Apache OpenOffice (AOO) Bugzilla – Issue 103308
HTML import mangles non-BMP unicodes
Last modified: 2023-01-03 09:37:09 UTC
The HTML filter uses the 16bit type sal_Unicode for all its text processing needs and so it strips of the most significant bits of unicodes beyond the baseplane. This results in a mangled import.
Created attachment 63344 [details] bugdoc
Fixing the method "sal_Unicode CSS1Parser::GetNextChar()" in sw/source/filter/html/parcss1.cxx is probably a good starting point.
set target
Reset assigne to the default "issues@openoffice.apache.org".
(In reply to hdu@apache.org from comment #2) > Fixing the method "sal_Unicode CSS1Parser::GetNextChar()" in > sw/source/filter/html/parcss1.cxx is > probably a good starting point. Yes but that's just CSS parsing, the remainder of the HTML parsing is in main/svtools/source/svhtml/parhtml.cxx, which, sadly like most of our codebase, also operates one Unicode code unit at a time, retrieved from SvParser::GetNextChar(). The function inline sal_uInt16 GetCharSize() const; got my hopes up, does it tell us the code point size? inline sal_uInt16 SvParser::GetCharSize() const { return (RTL_TEXTENCODING_UCS2 == eSrcEnc) ? 2 : 1; } No, just the bytes per BMP character for the current encoding, a useless statistic. SvParser does not have any functions for code points. We'd have to add them and change a lot of code - not just HTML parsing - to use them.