41246 – Unicode U+034F (Combining Grapheme Joiner) not interpreted correctly

Issue 41246 - Unicode U+034F (Combining Grapheme Joiner) not interpreted correctly

Summary: Unicode U+034F (Combining Grapheme Joiner) not interpreted correctly

Status:	CONFIRMED

Alias:	None

Product:	Internationalization
Classification:	Code
Component:	code (show other issues)
Version:	680m71
Hardware:	All All

Importance:	P3 Trivial (vote)
Target Milestone:	---
Assignee:	AOO issues mailing list
QA Contact:

URL:
Keywords:	oooqa

Depends on:
Blocks:

Reported:	2005-01-25 03:21 UTC by david
Modified:	2013-08-07 15:00 UTC (History)
CC List:	4 users (show)

See Also:
Issue Type:	ENHANCEMENT
Latest Confirmation in:	---
Developer Difficulty:	---

Attachments
plain text file containing CGJ U+034F (in UTF-8) (25 bytes, text/plain) 2005-05-31 22:28 UTC, david	no flags	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description david 2005-01-25 03:21:08 UTC

U+034F is a control character which should be invisible, and by default should
be ignored when searching and sorting.  For example, the text "C\u034Fhester"
should render identically to "Chester", and a search for "Chest" should find it.
 [*]

But at the moment, the presence of U+034F corrupts the rendering of a line of
text, and the search described above fails.  In certain languages, this prevents
the searching and algorithmic sorting of texts.

[*] From Section 15.2 of the Unicode Standard 4.0: "In language-sensitive
collation and searching, the combining grapheme joiner should be ignored [...]
in the default collation.  [...]  For rendering, the combining grapheme joiner
is invisible."

Comment 1 jack.warchold 2005-02-16 14:03:35 UTC

reassigned to us

can you please take a look on this?

Comment 2 flibby05 2005-03-12 20:07:46 UTC

set to NEW

Comment 3 ulf.stroehler 2005-05-30 13:14:39 UTC

@submitter/maxweber: could you pls. evaluate whether the problem still persits
in a current milestone. Thx.

Comment 4 flibby05 2005-05-30 13:33:58 UTC

us -> max

Comment 5 ulf.stroehler 2005-05-30 14:48:03 UTC

what?

Comment 6 flibby05 2005-05-30 18:19:31 UTC

>>@submitter/maxweber: could you pls. evaluate whether the problem still persits
>>in a current milestone. Thx.

>>what?
i reassigned the issue to me to express that i take it as my task to reproduce it.

Comment 7 flibby05 2005-05-30 18:22:49 UTC

reassign max -> us

Comment 8 ulf.stroehler 2005-05-31 18:19:25 UTC

us@maxweber: to avoid further confusion; could you pls. let me know what your
findings are. Thanks.

Comment 9 flibby05 2005-05-31 19:16:42 UTC

us, divec:
i cannot find CGJ via insert -> special character on my SuSE 9.3 with ttf 'symbol'.
any other ttf which would be helpful for reproducing this issue?

Comment 10 david 2005-05-31 22:28:59 UTC

Created attachment 26776 [details]
plain text file containing CGJ U+034F (in UTF-8)

Comment 11 david 2005-05-31 22:55:49 UTC

Thanks for looking at this!  I've attached a text file containing the CGJ.  You
have to load it in OOo with filetype "encoded text" and then choose the "UTF-8"
character set.  If you see a capital I with an acute accent, it has loaded with
the wrong character set.

The file contains the text "Ban\u034Fgor -> Ban<CGJ>gor" (i.e. the first
backslash is a real backslash, and the actual CGJ after "->" - sorry, I
should've made the example simpler).

When loaded, "Ban<CGJ>gor" *should* render as "Bangor" (i.e. the CGJ should be
invisible).  Searching for "Bangor" should succeed.

I've just tried it with m103 on Linux, it renders as "Ban[]gor" (i.e. with a
square in place of the CGJ), and searching for "Bangor" fails.  In other words,
the CGJ is being treated like a "normal" printable character which is not in the
font, instead of as a control character.

Actually, I've just noticed that's not quite true, because if you move the
cursor over "Ban<CGJ>gor", will not fall between the "n" and the CGJ.  So the
CGJ is presumably being recognised as modifying the "n", but without the correct
behaviour and semantics.

By the way, when I tried this originally, using m65 on Windows, the CGJ did not
display but the text of the whole line became corrupted.  Should this be tried
on Windows with a more recent milestone?

Comment 12 ulf.stroehler 2005-06-01 10:11:26 UTC

@divec: thanks for the explanation and example document.
You evaluation still holds true for e.g. a m106 what makes me think that we
simply not support this control character (at least not in Writer). Different
control chars as e.g. BOM (Byte Order Marks) work though.
Additionally could you provide a typical use scenario for this control char e.g.
in a wordprocessor app or is it just to be compliant with the Unicode spec. Thx.

US->HDU/SSA: something we want to support in vcl or do we need a decision from
UserEx group first?

Comment 13 hdu@apache.org 2005-06-02 08:25:18 UTC

Since we try to support the latest unicode standard supporting the U+034F (new
since Unicode4?) doesn't need special approval by UX, but they need to work on
it. This issue should get split up into three sub-issues:
- displaying the U+034F with "show non-printable characters enabled" needs to be
defined => UX
- sorting/searching of text containing U+034F => ER
- not showing U+034F as a "NotDef" box => HDU

Comment 14 ulf.stroehler 2006-04-04 17:19:00 UTC

have to reassign issue.

Comment 15 eric.savary 2006-08-29 14:54:14 UTC

Feature design overrides other issues.

ES->Requirements: Please consider splitting this enhancement in 3 parts as HDU
stated in its comment from Thu Jun 2 00:25:18 -0700 2005.

Reassigned