Issue 78055 - Khmer collation is not interpreted correctly
Summary: Khmer collation is not interpreted correctly
Status: CLOSED FIXED
Alias: None
Product: Internationalization
Classification: Code
Component: i18npool (show other issues)
Version: OOo 2.2.1 RC2
Hardware: All All
: P3 Trivial (vote)
Target Milestone: ---
Assignee: stefan.baltzer
QA Contact: issues@l10n
URL:
Keywords:
: 86516 (view as issue list)
Depends on:
Blocks:
 
Reported: 2007-06-03 08:12 UTC by lists
Modified: 2013-08-07 15:01 UTC (History)
6 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
Patch for new Khmer collation (710 bytes, patch)
2007-06-03 08:13 UTC, lists
no flags Details | Diff
Test case in OpenDocument (11.29 KB, application/vnd.oasis.opendocument.spreadsheet)
2007-06-09 03:33 UTC, lists
no flags Details
Test case in PDF (30.24 KB, application/pdf)
2007-06-09 03:34 UTC, lists
no flags Details
Second test for incorrect collation (8.67 KB, application/vnd.oasis.opendocument.spreadsheet)
2007-06-11 02:40 UTC, lists
no flags Details
Second test for incorrect collation - PDF (20.39 KB, application/pdf)
2007-06-11 02:41 UTC, lists
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description lists 2007-06-03 08:12:05 UTC
The Khmer collation sequence is interpreted by OOo almost correctly. The
sequence given, except the first character is respected... but this first
character, which marks where the sequence should be anchored  is not respected.

I am attaching a patch that cover all Khmer vowels, and which should solve the
immediate problem for Khmer (and we really need it now), but there is a deeper
problem with the interpretation of the sequence, if it is confirmed that the
sequence is not anchored where it should be, after the first character of the
series.

The series in anchored in the last vowel (highest code-point for vowels), and
then it includes a series of combinations of all vowels with another character
(basically each vowel followed by the character 17C7). What is happening is that
all the combined two-character sequences are listed BEFORE the vowels, and
before the anchoring code-point (first code-point of the sequence).

Incidentally, the last code point of the sequence is the same as the anchoring
code-point, but followed by 17C7

It would first be interesting to see if the new sequence attached solves the
problem or not, wich might give us some more information.
Comment 1 lists 2007-06-03 08:13:36 UTC
Created attachment 45618 [details]
Patch for new Khmer collation
Comment 2 ooo 2007-06-05 11:27:22 UTC
Karl,

are you aware of a general "collation anchoring' problem we (or ICU)
might have, maybe related to combined characters?


Javier,

> all the combined two-character sequences are listed BEFORE the vowels,
> and before the anchoring code-point (first code-point of the sequence).

"listed" where? Do we have a test case document for that? Please attach.

> It would first be interesting to see if the new sequence attached solves
> the problem or not, wich might give us some more information.

erm.. didn't you test the attached sequence? Anyway, a test case
document to be sure we test the right thing would be nice.

Thanks
  Eike
Comment 3 lists 2007-06-09 03:32:27 UTC
I think that I did test collation when it was implemented. I have been showing
it off to people. I will have to go back to old versions to see if they work.

I am attaching a test case, in both ODS and PDF format. If you look at the PDE,
you will see the tailoring sequence in column three. Column for is the same
sequence of characters (dependent vowels) sorted by calc. You can see that the
first character in the sequence has been passed to be the LAST character, the
rest of the sequence is respected.

Basically, what the sequence is saying is that in Khmer, the combination of a
vowel and the sign that looks like a colon should not be sorted after each
vowel, but all single vowels should be sorted first, and then you repeat the
whole sequence of vowels, if they are followed by the colon character.

These are dependent vowels. When you disply them alone, without a consonant, a
circle is palced in the location where the consonant would be if it was there.

Looking at columns one and two for a number of character combination, I start a
sequence of words with a consonant. If I have two times the same consonant (the
first consonant of the language) it should come right after the single consonant
when sorting correctly. Any words that have a vowel attached to the first
consonant should be listed afterwards.

But this does not happen. My conclusion is that all the composed vowels that are
in the sequence are being sorted SECONDARILY, as if they were accents or marks,
anf therefore thay are compared to the empty single consonant, and not to the
consonant with a simpler vowel (which should come first it these composed vowels
were primarily sorted).

My impression is that the sequence is being sorted secondarily, and that the
first vowel in the series (which I called the anchoring point, because all
things should be listed after it) is not being taken into account.

I believe that the sequence is correct.

I am nevertheless testing a coplete sequence that includes all the variables.
Again, it will not work if the vowels are sorted secondarily. The new sequence
will be included in Pavel's build for 2.2.1RC3 and I will be able to test it soon.

Comment 4 lists 2007-06-09 03:33:55 UTC
Created attachment 45769 [details]
Test case in OpenDocument
Comment 5 lists 2007-06-09 03:34:56 UTC
Created attachment 45770 [details]
Test case in PDF
Comment 6 lists 2007-06-11 02:37:57 UTC
I have tested the patch that I attached (in Pavel's 2.2.1RC3). 

The patch does not solve the problem, but gives further indication of what is
happening. 

The sorting is changed by the patch, but the problem persists, in a different
manner, but consistent with the prior problem.

All characters after the first one in the sequence (anchoring character) are
sorted as per the new sequence, but they are sorted secondarily (or tertiarily)
with very little weight (as diacritics). 

The first character in the series is the only one sorted correctly, and it is
sorted primarily where it should be. Now all voewls except the first one are
sorted secondarily as diacritics, and the first one is sorted after them,
primarily, as a second full character in the series.

The behaviour is consistent with what we had with the other tailoring sequence.
Instead of being sorted primarily after the first character in the collation
sequence, as they should be, all characters and combinations in the sequence
(excluding the first one) are sorted secondarily or tertiary starting at a
different point of much lower weight.

I am attaching an example of the behaviour with the new sequence.

We should test the collation sequences for other languages, and specially
sequences that sort characters or combinations primarily.
Comment 7 lists 2007-06-11 02:40:35 UTC
Created attachment 45790 [details]
Second test for incorrect collation
Comment 8 lists 2007-06-11 02:41:27 UTC
Created attachment 45791 [details]
Second test for incorrect collation - PDF
Comment 9 lists 2007-06-11 12:54:37 UTC
I have tested 2.1 and 2.2 in Windows and Linux.

In LINUX it works correctly, the issue only exists for collation on Windows.

In 2.1 and 2.2 computers that I have tested, the collation sequence is not
interpreted at all on Windows. On Windows collation follows strictly UCA order.

In my computer in 2.2.1 it runs a sequence that is incorrect, but different from
UCA.
Comment 10 ooo 2007-06-11 16:57:02 UTC
Hi Javier,

So, adding these up:

> I believe that the sequence is correct.
> [...]
> The patch does not solve the problem, but gives further indication of what is
> happening.
> 
> The sorting is changed by the patch, but the problem persists, in a different
> manner, but consistent with the prior problem.
> [...]
> In LINUX it works correctly, the issue only exists for collation on Windows.

Is that "Linux works, Windows doesn't" for versions only without the
patch? Or also with patch? And should we integrate the patch, will it
change anything to the better? 

Puzzled
  Eike
Comment 11 lists 2007-06-12 03:15:12 UTC
The patch is useless, it just produces a different type of malfunction.

I believe that the problem is independent from Khmer. What I am seeing points at
the sequence either not being applied or being applied incorrectly on Windows,
independently of what the sequence is. It works correctly for Linux.

I don't know enoough about the source to know how collation is done differently
on Windows and on Linux, or how the interpretation of the file can be different
in both, or if something like Linux line-endings or maybe a BOM or something
like this would make the file be understood differently, but I assume that after
compilation these things should work just identically in both platforms.

Or is one based on ICU, but not the other?
Comment 12 subirbp 2007-06-15 07:30:28 UTC
Add to CC.
Comment 13 lists 2007-06-19 13:28:31 UTC
I have done some more testing and I can confirm that it is a regression
introduced  in 2.2. It works correctly for 2.1, but not for 2.2 and 2.2.1. It
happens only on Windows when the Khmer locale is selected (if not selected, UCA
is applied).

Consistently, tested with two different collation sequences (the present one and
the one I proposed in the patch here), it creates a sequence with very little
weight (as if they were diacritics), which is different from what I though
before (that it was sorted secondarily or tertiarily), anchoring the sequence
(excepting the first character) at or near point to ot at the beginning of the
UCA, and not on the first (reset) character of the sequence, as expected and as
it happened in the past 
Comment 14 ooo 2007-10-17 14:56:21 UTC
As I won't find time soon to dive into this I'm reassigning the issue to Karl,
hopefully he'll have some insight.

@khong: if possible, please fix for OOo2.4
Comment 15 karl.hong 2008-01-08 04:55:30 UTC
ICU has a problem to handle image/binary rule for certain code ranges, while it
does not have problem for text rule. 

For small collator tailoring string, the compiling takes no time. So I move this
type of tailoring string to localedata, the rule will be passed to ICU collator
as text rule. This can not only bypass the ICU bug, but also simplify adding
small tailorings. For large tailoring, like ones for Chinese collators, we will
still use compiled binary rules to improve performance. ICU has no problem for
binary rule on Chinese code range.
Comment 16 ooo 2008-01-08 11:22:13 UTC
Karl,
The "problem to handle image/binary rule for certain code ranges" sounds like an
ICU bug, or isn't it? If so, could you please file a bug against ICU and tell me
the bugID?
Thanks
  Eike
Comment 17 lists 2008-01-08 14:29:08 UTC
I confirm that the test build works as expected: UCA sorting with en-US locale,
and correct tailored sequence with Khmer locale.
Comment 18 karl.hong 2008-01-08 16:41:03 UTC
Eike,
I have filed a bug 6131,
http://bugs.icu-project.org/trac/ticket/6131#preview
Karl
Comment 19 karl.hong 2008-01-08 19:20:20 UTC
read for QA.
Comment 20 ooo 2008-01-09 12:23:12 UTC
Karl, thanks for filing the bug, I just setup a wiki page to collect information
about ICU bugs and patches and added that, see
http://wiki.services.openoffice.org/wiki/ICU/bugs_and_patches
Eike
Comment 21 karl.hong 2008-01-11 07:38:54 UTC
ICU engineer helped to identify the problem in our code. 

Collator contructor from image rule needs a UCA based collator as fallback, we
used to use language specific collator.

I fixed it. But I would keep previous fix to have option to use text rule from
locale data file, since it is simple when tailing rule is small.
Comment 22 stefan.baltzer 2008-01-16 16:38:24 UTC
Verified in CWS i18n39.
Comment 23 karl.hong 2008-02-29 05:55:56 UTC
*** Issue 86516 has been marked as a duplicate of this issue. ***
Comment 24 thorsten.ziehm 2009-07-20 14:55:41 UTC
This issue is closed automatically and wasn't rechecked in a current version of
OOo. The fixed issue should be integrated in OOo since more than half a year. If
you think this issue isn't fixed in a current version (OOo 3.1), please reopen
it and change the field 'Target Milestone' accordingly.

If you want to download a current version of OOo =>
http://download.openoffice.org/index.html
If you want to know more about the handling of fixed/verified issues =>
http://wiki.services.openoffice.org/wiki/Handle_fixed_verified_issues