Issue 78729 - Collation tailored to Khoe
Summary: Collation tailored to Khoe
Status: CONFIRMED
Alias: None
Product: Internationalization
Classification: Code
Component: i18npool (show other issues)
Version: OOo 2.2.1
Hardware: All All
: P3 Trivial (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords: needhelp
Depends on:
Blocks:
 
Reported: 2007-06-21 06:38 UTC by jdkaye
Modified: 2017-05-20 11:13 UTC (History)
1 user (show)

See Also:
Issue Type: ENHANCEMENT
Latest Confirmation in: ---
Developer Difficulty: ---


Attachments
sample spreadsheet (11.24 KB, text/plain)
2007-08-06 16:24 UTC, jdkaye
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description jdkaye 2007-06-21 06:38:45 UTC
When doing a sort using upper ascii characters (128-255) the presence of
trailing numbers effects the sorting results. This is very strange and probably
quite easy to correct.

You can try the experiment yourself. It's quite surprising. From The
OpenOffice.org sort order for the Latin 1 characters you see that the character
Ô appears before the character Õ. In an empty spreadsheet put aÕ in cell A1 and
aÔ in A2. Do a sort based on column A and the contents of cells A1 and A2 change
places. This is exactly what you expect since Õ is ordered AFTER Ô in the
collating sequence. Ok now put aÔ3 in cell A3 and aÕ2 in A4 and do the sort
again. You would expect that aÔ3 remains above in the sorted version right? But
it doesn't! Now enter aO in A5 and sort again. aO appears at the top of the
list. Edit cell A1 (now containing aO) changing it to aO9 and do another sort.
Boom! Now aO is at the bottom of the list.

The problem seems clear: the addion of numbers changes the sensitivity of
the sort. If there are no numbers then the characters O, Ô and Õ are
distinct and sorted in the order given by the table. If you add a following
number they all merge and the sort is then based on sorting the following number.
To summarise: aO5 aÕ2 aÔ4 will give a bad sort. If the numbers are the same
as in aO5 aÕ5 aÔ5 then the sort is good BUT if anything follows the "5" in
the previous example then the sort is bad so aO5z aÕ5a aÔ5g is sorted based
on the FINAL character. The sort comes out as aÕ5a, aÔ5g, aO5z.

I don't think this can be a feature so I guess it must be a bug.
Comment 1 jdkaye 2007-06-21 14:47:54 UTC
I've done some further checking and in fact *any* character, number of letter
will effect the higher ascii characters.
aáb aâc aàd aäe
These sequences get sorted as shown above, according to the 3rd character and
not the 2nd.

Comment 2 frank 2007-08-06 12:43:43 UTC
Hi,

please attach a document showing your problem.

Frank
Comment 3 jdkaye 2007-08-06 16:24:17 UTC
Created attachment 47339 [details]
sample spreadsheet
Comment 4 jdkaye 2007-08-06 16:45:38 UTC
Sorry, Frank. A power cut interrupted me.I have sent a small (20 row)
spreadsheet to illustrate the problem.The sort is being done on Column A. Column
A is a mapping of Column B & C in order to implement a complicated collating
sequence quite far removed from the actual form of the data entered. This is a
dictionary of Khoe (a Namibian language) with grammatical information and
translations into Afrikaans.

If you look at the spreadsheet starting at row 9 you see the problem The first 7
characters are identical and the problems appears in column 8.The sort was done
with case sensitivity switched on. Column 8 sometimes has ó (rows 10, 13, etc)
and sometimes with ò (rows 11, 12, 15 ...). I am not interested in the intrinsic
value of the characters in Column A, only their ascii codes. I am using them to
do a sort which I need done strictly according to their ascii codes. I
understand that OOo uses unicode conventions to do this sort so I guess it
really is a feature. I am asking to be able to turn this feature of and do a
sort strictly by ascii code and nothing else (much like turning of case
insensitivity). I think this should be easy to do and would make OO much more
useful as a tool for working with "exotic" languages.
I hope this is clear.
Thanks,
Jonathan
Comment 5 ooo 2007-08-07 16:32:28 UTC
Hi Jonathan,

The sort results you encounter depend on the collation algorithm used
with a specific language/locale. For en_US, as for many other Western
languages, this is the default Unicode Collation Algorithm, with the
tertiary weight ignored, even if case sensitivity is switched on. In
practice this means that accented characters are sorted at the position
the unaccented character has, if not defined different for that
language. For example, if you change the sort language to Danish you'll
see that in your example strings containing 'Ø' are treated differently.
If accented characters have identical "light" weights and an additional
character is added, that character of course adds to the weight and
moves a string to another position.

Btw, there a no "higher ASCII" characters. What you're referring is some
text encoding, for example Latin1, that you are used to, which
represents some of the Unicode characters in values between decimal 128
and 255. For Latin1 and Unicode those values happen to be identical by
design.

So, what you are actually requesting is not some "ignore collation and
sort Unicode code points" feature (which probably wouldn't be correct),
but you want a collation tailored to Khoe instead. As you somewhat seem
not to be unfamiliar with code points and maybe programming, you may
want to take a look at the i18npool/source/collator/data/ directory that
contains some tailored collations. Anyway, I'm setting the 'needhelp'
keyword, since I'm not familiar with the Khoe language.

Thanks
  Eike
Comment 6 jdkaye 2007-08-07 17:39:22 UTC
Thanks a lot for you comments, Eike. I guess my age is showing. Back in the days
before unicode (when this project was started) all we had were 255 codes some of
which, of course, were reserved. I guess I never came to grips with unicode and
still use the old-fashioned terminology. In my own defense when I feed this
weird symbols into a hex editor, it does give me a numberic value for the
character in question. As I said, the actual shape of the symbol that appears is
of no interest to me, I only wish to use the code (the one that comes up in a
hex editor) for sorting purposes. Ok, let's say I want to sort using the
ISO-8859-15 encoding. I have a string that looks like this (I don't know if I
can reproduce the encoding) ÔÔ24!ÔÔ24!ie& and would have the following codes in
hex: d4 d4 32 34 21 d4 d4 32 34 21 69 65 26
If I used a UTF8 encoding then things would look a bit different ��24!��24!ie&
and the codes would come out as: 00 00 32 34 21 00 00 32 34 21 69 65 26, that is
all the codes above 7f (hex) come out as 00. Fine, if I can sort, relative to an
encoding, then there is no ambiguity about what actual code I want. I hope this
is clearer.

Regarding the summary, the collation is not only for Khoe but for the thousands
of languages with no written tradition which may be worked on at some point. A
code based sort based on pure code wrt a given character encoding would have
many other applications as well. I would suggest changing the summary to
something like "General code-based collations"
Comment 7 Marcus 2017-05-20 11:13:23 UTC
Reset assigne to the default "issues@openoffice.apache.org".