Issue 103308 - HTML import mangles non-BMP unicodes
Summary: HTML import mangles non-BMP unicodes
Status: CONFIRMED
Alias: None
Product: Writer
Classification: Application
Component: open-import (show other issues)
Version: OOO310m14
Hardware: All All
: P3 Trivial (vote)
Target Milestone: ---
Assignee: AOO issues mailing list
QA Contact:
URL:
Keywords:
Depends on:
Blocks: 102943
  Show dependency tree
 
Reported: 2009-07-03 08:30 UTC by hdu@apache.org
Modified: 2023-01-03 09:37 UTC (History)
2 users (show)

See Also:
Issue Type: DEFECT
Latest Confirmation in: 4.2.0-dev
Developer Difficulty: ---


Attachments
bugdoc (1.43 KB, text/html)
2009-07-03 08:33 UTC, hdu@apache.org
no flags Details

Note You need to log in before you can comment on or make changes to this issue.
Description hdu@apache.org 2009-07-03 08:30:23 UTC
The HTML filter uses the 16bit type sal_Unicode for all its text processing needs and so it strips of the 
most significant bits of unicodes beyond the baseplane. This results in a mangled import.
Comment 1 hdu@apache.org 2009-07-03 08:33:06 UTC
Created attachment 63344 [details]
bugdoc
Comment 2 hdu@apache.org 2009-07-03 08:38:13 UTC
Fixing the method "sal_Unicode CSS1Parser::GetNextChar()" in sw/source/filter/html/parcss1.cxx is 
probably a good starting point.
Comment 3 openoffice 2009-07-03 08:54:58 UTC
set target
Comment 4 Marcus 2017-05-20 11:18:15 UTC
Reset assigne to the default "issues@openoffice.apache.org".
Comment 5 damjan 2023-01-03 09:37:09 UTC
(In reply to hdu@apache.org from comment #2)
> Fixing the method "sal_Unicode CSS1Parser::GetNextChar()" in
> sw/source/filter/html/parcss1.cxx is 
> probably a good starting point.

Yes but that's just CSS parsing, the remainder of the HTML parsing is in main/svtools/source/svhtml/parhtml.cxx, which, sadly like most of our codebase, also operates one Unicode code unit at a time, retrieved from  SvParser::GetNextChar().

The function
inline sal_uInt16 GetCharSize() const;
got my hopes up, does it tell us the code point size?

inline sal_uInt16 SvParser::GetCharSize() const
{
    return (RTL_TEXTENCODING_UCS2 == eSrcEnc) ? 2 : 1;
}

No, just the bytes per BMP character for the current encoding, a useless statistic.

SvParser does not have any functions for code points. We'd have to add them and change a lot of code - not just HTML parsing - to use them.