Apache OpenOffice (AOO) Bugzilla – Issue 41246
Unicode U+034F (Combining Grapheme Joiner) not interpreted correctly
Last modified: 2013-08-07 15:00:01 UTC
U+034F is a control character which should be invisible, and by default should be ignored when searching and sorting. For example, the text "C\u034Fhester" should render identically to "Chester", and a search for "Chest" should find it. [*] But at the moment, the presence of U+034F corrupts the rendering of a line of text, and the search described above fails. In certain languages, this prevents the searching and algorithmic sorting of texts. [*] From Section 15.2 of the Unicode Standard 4.0: "In language-sensitive collation and searching, the combining grapheme joiner should be ignored [...] in the default collation. [...] For rendering, the combining grapheme joiner is invisible."
reassigned to us can you please take a look on this?
set to NEW
@submitter/maxweber: could you pls. evaluate whether the problem still persits in a current milestone. Thx.
us -> max
what?
>>@submitter/maxweber: could you pls. evaluate whether the problem still persits >>in a current milestone. Thx. >>what? i reassigned the issue to me to express that i take it as my task to reproduce it.
reassign max -> us
us@maxweber: to avoid further confusion; could you pls. let me know what your findings are. Thanks.
us, divec: i cannot find CGJ via insert -> special character on my SuSE 9.3 with ttf 'symbol'. any other ttf which would be helpful for reproducing this issue?
Created attachment 26776 [details] plain text file containing CGJ U+034F (in UTF-8)
Thanks for looking at this! I've attached a text file containing the CGJ. You have to load it in OOo with filetype "encoded text" and then choose the "UTF-8" character set. If you see a capital I with an acute accent, it has loaded with the wrong character set. The file contains the text "Ban\u034Fgor -> Ban<CGJ>gor" (i.e. the first backslash is a real backslash, and the actual CGJ after "->" - sorry, I should've made the example simpler). When loaded, "Ban<CGJ>gor" *should* render as "Bangor" (i.e. the CGJ should be invisible). Searching for "Bangor" should succeed. I've just tried it with m103 on Linux, it renders as "Ban[]gor" (i.e. with a square in place of the CGJ), and searching for "Bangor" fails. In other words, the CGJ is being treated like a "normal" printable character which is not in the font, instead of as a control character. Actually, I've just noticed that's not quite true, because if you move the cursor over "Ban<CGJ>gor", will not fall between the "n" and the CGJ. So the CGJ is presumably being recognised as modifying the "n", but without the correct behaviour and semantics. By the way, when I tried this originally, using m65 on Windows, the CGJ did not display but the text of the whole line became corrupted. Should this be tried on Windows with a more recent milestone?
@divec: thanks for the explanation and example document. You evaluation still holds true for e.g. a m106 what makes me think that we simply not support this control character (at least not in Writer). Different control chars as e.g. BOM (Byte Order Marks) work though. Additionally could you provide a typical use scenario for this control char e.g. in a wordprocessor app or is it just to be compliant with the Unicode spec. Thx. US->HDU/SSA: something we want to support in vcl or do we need a decision from UserEx group first?
Since we try to support the latest unicode standard supporting the U+034F (new since Unicode4?) doesn't need special approval by UX, but they need to work on it. This issue should get split up into three sub-issues: - displaying the U+034F with "show non-printable characters enabled" needs to be defined => UX - sorting/searching of text containing U+034F => ER - not showing U+034F as a "NotDef" box => HDU
have to reassign issue.
Feature design overrides other issues. ES->Requirements: Please consider splitting this enhancement in 3 parts as HDU stated in its comment from Thu Jun 2 00:25:18 -0700 2005. Reassigned