Apache OpenOffice (AOO) Bugzilla – Issue 78729
Collation tailored to Khoe
Last modified: 2017-05-20 11:13:23 UTC
When doing a sort using upper ascii characters (128-255) the presence of trailing numbers effects the sorting results. This is very strange and probably quite easy to correct. You can try the experiment yourself. It's quite surprising. From The OpenOffice.org sort order for the Latin 1 characters you see that the character Ô appears before the character Õ. In an empty spreadsheet put aÕ in cell A1 and aÔ in A2. Do a sort based on column A and the contents of cells A1 and A2 change places. This is exactly what you expect since Õ is ordered AFTER Ô in the collating sequence. Ok now put aÔ3 in cell A3 and aÕ2 in A4 and do the sort again. You would expect that aÔ3 remains above in the sorted version right? But it doesn't! Now enter aO in A5 and sort again. aO appears at the top of the list. Edit cell A1 (now containing aO) changing it to aO9 and do another sort. Boom! Now aO is at the bottom of the list. The problem seems clear: the addion of numbers changes the sensitivity of the sort. If there are no numbers then the characters O, Ô and Õ are distinct and sorted in the order given by the table. If you add a following number they all merge and the sort is then based on sorting the following number. To summarise: aO5 aÕ2 aÔ4 will give a bad sort. If the numbers are the same as in aO5 aÕ5 aÔ5 then the sort is good BUT if anything follows the "5" in the previous example then the sort is bad so aO5z aÕ5a aÔ5g is sorted based on the FINAL character. The sort comes out as aÕ5a, aÔ5g, aO5z. I don't think this can be a feature so I guess it must be a bug.
I've done some further checking and in fact *any* character, number of letter will effect the higher ascii characters. aáb aâc aàd aäe These sequences get sorted as shown above, according to the 3rd character and not the 2nd.
Hi, please attach a document showing your problem. Frank
Created attachment 47339 [details] sample spreadsheet
Sorry, Frank. A power cut interrupted me.I have sent a small (20 row) spreadsheet to illustrate the problem.The sort is being done on Column A. Column A is a mapping of Column B & C in order to implement a complicated collating sequence quite far removed from the actual form of the data entered. This is a dictionary of Khoe (a Namibian language) with grammatical information and translations into Afrikaans. If you look at the spreadsheet starting at row 9 you see the problem The first 7 characters are identical and the problems appears in column 8.The sort was done with case sensitivity switched on. Column 8 sometimes has ó (rows 10, 13, etc) and sometimes with ò (rows 11, 12, 15 ...). I am not interested in the intrinsic value of the characters in Column A, only their ascii codes. I am using them to do a sort which I need done strictly according to their ascii codes. I understand that OOo uses unicode conventions to do this sort so I guess it really is a feature. I am asking to be able to turn this feature of and do a sort strictly by ascii code and nothing else (much like turning of case insensitivity). I think this should be easy to do and would make OO much more useful as a tool for working with "exotic" languages. I hope this is clear. Thanks, Jonathan
Hi Jonathan, The sort results you encounter depend on the collation algorithm used with a specific language/locale. For en_US, as for many other Western languages, this is the default Unicode Collation Algorithm, with the tertiary weight ignored, even if case sensitivity is switched on. In practice this means that accented characters are sorted at the position the unaccented character has, if not defined different for that language. For example, if you change the sort language to Danish you'll see that in your example strings containing 'Ø' are treated differently. If accented characters have identical "light" weights and an additional character is added, that character of course adds to the weight and moves a string to another position. Btw, there a no "higher ASCII" characters. What you're referring is some text encoding, for example Latin1, that you are used to, which represents some of the Unicode characters in values between decimal 128 and 255. For Latin1 and Unicode those values happen to be identical by design. So, what you are actually requesting is not some "ignore collation and sort Unicode code points" feature (which probably wouldn't be correct), but you want a collation tailored to Khoe instead. As you somewhat seem not to be unfamiliar with code points and maybe programming, you may want to take a look at the i18npool/source/collator/data/ directory that contains some tailored collations. Anyway, I'm setting the 'needhelp' keyword, since I'm not familiar with the Khoe language. Thanks Eike
Thanks a lot for you comments, Eike. I guess my age is showing. Back in the days before unicode (when this project was started) all we had were 255 codes some of which, of course, were reserved. I guess I never came to grips with unicode and still use the old-fashioned terminology. In my own defense when I feed this weird symbols into a hex editor, it does give me a numberic value for the character in question. As I said, the actual shape of the symbol that appears is of no interest to me, I only wish to use the code (the one that comes up in a hex editor) for sorting purposes. Ok, let's say I want to sort using the ISO-8859-15 encoding. I have a string that looks like this (I don't know if I can reproduce the encoding) ÔÔ24!ÔÔ24!ie& and would have the following codes in hex: d4 d4 32 34 21 d4 d4 32 34 21 69 65 26 If I used a UTF8 encoding then things would look a bit different ��24!��24!ie& and the codes would come out as: 00 00 32 34 21 00 00 32 34 21 69 65 26, that is all the codes above 7f (hex) come out as 00. Fine, if I can sort, relative to an encoding, then there is no ambiguity about what actual code I want. I hope this is clearer. Regarding the summary, the collation is not only for Khoe but for the thousands of languages with no written tradition which may be worked on at some point. A code based sort based on pure code wrt a given character encoding would have many other applications as well. I would suggest changing the summary to something like "General code-based collations"
Reset assigne to the default "issues@openoffice.apache.org".