Apache OpenOffice (AOO) Bugzilla – Issue 78055
Khmer collation is not interpreted correctly
Last modified: 2013-08-07 15:01:53 UTC
The Khmer collation sequence is interpreted by OOo almost correctly. The sequence given, except the first character is respected... but this first character, which marks where the sequence should be anchored is not respected. I am attaching a patch that cover all Khmer vowels, and which should solve the immediate problem for Khmer (and we really need it now), but there is a deeper problem with the interpretation of the sequence, if it is confirmed that the sequence is not anchored where it should be, after the first character of the series. The series in anchored in the last vowel (highest code-point for vowels), and then it includes a series of combinations of all vowels with another character (basically each vowel followed by the character 17C7). What is happening is that all the combined two-character sequences are listed BEFORE the vowels, and before the anchoring code-point (first code-point of the sequence). Incidentally, the last code point of the sequence is the same as the anchoring code-point, but followed by 17C7 It would first be interesting to see if the new sequence attached solves the problem or not, wich might give us some more information.
Created attachment 45618 [details] Patch for new Khmer collation
Karl, are you aware of a general "collation anchoring' problem we (or ICU) might have, maybe related to combined characters? Javier, > all the combined two-character sequences are listed BEFORE the vowels, > and before the anchoring code-point (first code-point of the sequence). "listed" where? Do we have a test case document for that? Please attach. > It would first be interesting to see if the new sequence attached solves > the problem or not, wich might give us some more information. erm.. didn't you test the attached sequence? Anyway, a test case document to be sure we test the right thing would be nice. Thanks Eike
I think that I did test collation when it was implemented. I have been showing it off to people. I will have to go back to old versions to see if they work. I am attaching a test case, in both ODS and PDF format. If you look at the PDE, you will see the tailoring sequence in column three. Column for is the same sequence of characters (dependent vowels) sorted by calc. You can see that the first character in the sequence has been passed to be the LAST character, the rest of the sequence is respected. Basically, what the sequence is saying is that in Khmer, the combination of a vowel and the sign that looks like a colon should not be sorted after each vowel, but all single vowels should be sorted first, and then you repeat the whole sequence of vowels, if they are followed by the colon character. These are dependent vowels. When you disply them alone, without a consonant, a circle is palced in the location where the consonant would be if it was there. Looking at columns one and two for a number of character combination, I start a sequence of words with a consonant. If I have two times the same consonant (the first consonant of the language) it should come right after the single consonant when sorting correctly. Any words that have a vowel attached to the first consonant should be listed afterwards. But this does not happen. My conclusion is that all the composed vowels that are in the sequence are being sorted SECONDARILY, as if they were accents or marks, anf therefore thay are compared to the empty single consonant, and not to the consonant with a simpler vowel (which should come first it these composed vowels were primarily sorted). My impression is that the sequence is being sorted secondarily, and that the first vowel in the series (which I called the anchoring point, because all things should be listed after it) is not being taken into account. I believe that the sequence is correct. I am nevertheless testing a coplete sequence that includes all the variables. Again, it will not work if the vowels are sorted secondarily. The new sequence will be included in Pavel's build for 2.2.1RC3 and I will be able to test it soon.
Created attachment 45769 [details] Test case in OpenDocument
Created attachment 45770 [details] Test case in PDF
I have tested the patch that I attached (in Pavel's 2.2.1RC3). The patch does not solve the problem, but gives further indication of what is happening. The sorting is changed by the patch, but the problem persists, in a different manner, but consistent with the prior problem. All characters after the first one in the sequence (anchoring character) are sorted as per the new sequence, but they are sorted secondarily (or tertiarily) with very little weight (as diacritics). The first character in the series is the only one sorted correctly, and it is sorted primarily where it should be. Now all voewls except the first one are sorted secondarily as diacritics, and the first one is sorted after them, primarily, as a second full character in the series. The behaviour is consistent with what we had with the other tailoring sequence. Instead of being sorted primarily after the first character in the collation sequence, as they should be, all characters and combinations in the sequence (excluding the first one) are sorted secondarily or tertiary starting at a different point of much lower weight. I am attaching an example of the behaviour with the new sequence. We should test the collation sequences for other languages, and specially sequences that sort characters or combinations primarily.
Created attachment 45790 [details] Second test for incorrect collation
Created attachment 45791 [details] Second test for incorrect collation - PDF
I have tested 2.1 and 2.2 in Windows and Linux. In LINUX it works correctly, the issue only exists for collation on Windows. In 2.1 and 2.2 computers that I have tested, the collation sequence is not interpreted at all on Windows. On Windows collation follows strictly UCA order. In my computer in 2.2.1 it runs a sequence that is incorrect, but different from UCA.
Hi Javier, So, adding these up: > I believe that the sequence is correct. > [...] > The patch does not solve the problem, but gives further indication of what is > happening. > > The sorting is changed by the patch, but the problem persists, in a different > manner, but consistent with the prior problem. > [...] > In LINUX it works correctly, the issue only exists for collation on Windows. Is that "Linux works, Windows doesn't" for versions only without the patch? Or also with patch? And should we integrate the patch, will it change anything to the better? Puzzled Eike
The patch is useless, it just produces a different type of malfunction. I believe that the problem is independent from Khmer. What I am seeing points at the sequence either not being applied or being applied incorrectly on Windows, independently of what the sequence is. It works correctly for Linux. I don't know enoough about the source to know how collation is done differently on Windows and on Linux, or how the interpretation of the file can be different in both, or if something like Linux line-endings or maybe a BOM or something like this would make the file be understood differently, but I assume that after compilation these things should work just identically in both platforms. Or is one based on ICU, but not the other?
Add to CC.
I have done some more testing and I can confirm that it is a regression introduced in 2.2. It works correctly for 2.1, but not for 2.2 and 2.2.1. It happens only on Windows when the Khmer locale is selected (if not selected, UCA is applied). Consistently, tested with two different collation sequences (the present one and the one I proposed in the patch here), it creates a sequence with very little weight (as if they were diacritics), which is different from what I though before (that it was sorted secondarily or tertiarily), anchoring the sequence (excepting the first character) at or near point to ot at the beginning of the UCA, and not on the first (reset) character of the sequence, as expected and as it happened in the past
As I won't find time soon to dive into this I'm reassigning the issue to Karl, hopefully he'll have some insight. @khong: if possible, please fix for OOo2.4
ICU has a problem to handle image/binary rule for certain code ranges, while it does not have problem for text rule. For small collator tailoring string, the compiling takes no time. So I move this type of tailoring string to localedata, the rule will be passed to ICU collator as text rule. This can not only bypass the ICU bug, but also simplify adding small tailorings. For large tailoring, like ones for Chinese collators, we will still use compiled binary rules to improve performance. ICU has no problem for binary rule on Chinese code range.
Karl, The "problem to handle image/binary rule for certain code ranges" sounds like an ICU bug, or isn't it? If so, could you please file a bug against ICU and tell me the bugID? Thanks Eike
I confirm that the test build works as expected: UCA sorting with en-US locale, and correct tailored sequence with Khmer locale.
Eike, I have filed a bug 6131, http://bugs.icu-project.org/trac/ticket/6131#preview Karl
read for QA.
Karl, thanks for filing the bug, I just setup a wiki page to collect information about ICU bugs and patches and added that, see http://wiki.services.openoffice.org/wiki/ICU/bugs_and_patches Eike
ICU engineer helped to identify the problem in our code. Collator contructor from image rule needs a UCA based collator as fallback, we used to use language specific collator. I fixed it. But I would keep previous fix to have option to use text rule from locale data file, since it is simple when tailing rule is small.
Verified in CWS i18n39.
*** Issue 86516 has been marked as a duplicate of this issue. ***
This issue is closed automatically and wasn't rechecked in a current version of OOo. The fixed issue should be integrated in OOo since more than half a year. If you think this issue isn't fixed in a current version (OOo 3.1), please reopen it and change the field 'Target Milestone' accordingly. If you want to download a current version of OOo => http://download.openoffice.org/index.html If you want to know more about the handling of fixed/verified issues => http://wiki.services.openoffice.org/wiki/Handle_fixed_verified_issues