Apache OpenOffice (AOO) Bugzilla – Full Text Issue Listing |
Summary: | meta-issue for tracking non-baseplane unicode problems | ||
---|---|---|---|
Product: | Internationalization | Reporter: | hdu <hdu> |
Component: | code | Assignee: | AOO issues mailing list <issues> |
Status: | ACCEPTED --- | QA Contact: | |
Severity: | Trivial | ||
Priority: | P3 | CC: | fonts-bugs, hanya.runo, issues, kamataki, khirano, maho.nakata, mst.ooo, ooo, orw, stephan.bergmann.secondary |
Version: | OOo 3.0 | Keywords: | CJK, performance |
Target Milestone: | --- | ||
Hardware: | All | ||
OS: | All | ||
Issue Type: | DEFECT | Latest Confirmation in: | --- |
Developer Difficulty: | --- | ||
Issue Depends on: | 41792, 49432, 78162, 102200, 103123, 103308, 124312, 125232, 125257, 74049, 75412, 102920, 105571, 105901, 107468, 113757, 120442 | ||
Issue Blocks: |
Description
hdu@apache.org
2009-06-19 11:47:18 UTC
added first batch of dependencies more blocking issues hi hdu, these kinds of problems don't surprise me at all. and no matter how many of these you solve, it's all too easy to add new ones. have you thought about attacking the root of the problem, and remove ::rtl::OUString::operator sal_Unicode* ? if someone really has a need to access UTF-16 code units as opposed to Unicode characters (say for serialization), make them use a getBuffer method (i think it already exists). ob. quote: "UTF-16 is the devil's work." -- Robert O'Callahan hi hdu, these kinds of problems don't surprise me at all. and no matter how many of these you solve, it's all too easy to add new ones. have you thought about attacking the root of the problem, and remove ::rtl::OUString::operator sal_Unicode* ? if someone really has a need to access UTF-16 code units as opposed to Unicode characters (say for serialization), make them use a getBuffer method (i think it already exists). ob. quote: "UTF-16 is the devil's work." -- Robert O'Callahan IMHO an approach based on string iterators would work better as it would isolate implementation details (such as an internal UTF-16 representation) from its use. E.g. for working on unicode codepoints one would get an UTF-32 iterator, for encoding conversions (e.g. to big5) one would get other suitable iterators. By splitting unicode string's implementation details (UTF-16) from its interface (specialized string iterators) this could also speed up such performance critical tasks as XML parsing. XML text is usually encoded as UTF-8 and AFAIK it currently has to be converted to UTF-16 for further processing. By keeping the inputs native encoding as an implementation detail the conversion step which is costly (from a processing, from a memory and from a spinlock contention perspective) could be avoided altogether. @ hdu: hmm, your iterator suggestion sounds pretty ideal; but can you maintain binary compatibility with the existing ::rtl::OUString if you change its representation? i haven't dared think about this :) or do you suggest a new string class, with (mostly) the same interface, but without the problematic methods, and with some kind of efficient conversion to/from OUString? [sorry for the double posting, seems i accidentally clicked in the wrong place] Re "With some automation it should be possible to find code that uses indiviual sal_Unicodes" see <http://www.openoffice.org/servlets/ReadMsg?listName=dev&msgNo=18462> (from 2006). Re "your iterator suggestion sounds pretty ideal" see the existing rtl::OUString::iterateCodePoints. @ sb: yes, there is rtl::OUString::iterateCodePoints. but there is also rtl::OUString::operator sal_Unicode*, and that is unfortunately a _lot_ more popular with users of OUString. and hdu actually suggested to have a Unicode string that can internally store any encoding, but externally present only iterator-based interfaces for various encodings; hence my concern about the binary compatibility of such a contraption. Yes, msgNo=18462 is a good start as it identified problems in the UNO API. Finding the remaining problems (individual sal_Unicodes) is the other important task. OUString::iterateCodePoints() was a good start too, as it was the first method in the string area which didn't require its users to handle surrogate pairs themselves. The iterator approach I outlined above is IMHO better though because it could allow zero-conversion and zero-copy access to raw data, such as the performance critical XML files. The current approach to convert them first to UTF-16 and then use iterateCodePoints() to convert them to UTF-32 does not have that benefit. I forgot to mention the benefit that specialized iterators could also do such nice things as providing transliteration, unicode decomposition, pre-composition, digit-conversion, etc. in an orthogonal way. CC myself Added CC myself Reset the assignee to the default "issues@openoffice.apache.org". |