103308 – HTML import mangles non-BMP unicodes

Issue 103308 - HTML import mangles non-BMP unicodes

Summary: HTML import mangles non-BMP unicodes

Status:	CONFIRMED

Alias:	None

Product:	Writer
Classification:	Application
Component:	open-import (show other issues)
Version:	OOO310m14
Hardware:	All All

Importance:	P3 Trivial (vote)
Target Milestone:	---
Assignee:	AOO issues mailing list
QA Contact:

URL:
Keywords:

Depends on:
Blocks:	102943
	Show dependency tree

Reported:	2009-07-03 08:30 UTC by hdu@apache.org
Modified:	2023-01-03 09:37 UTC (History)
CC List:	2 users (show)

See Also:
Issue Type:	DEFECT
Latest Confirmation in:	4.2.0-dev
Developer Difficulty:	---

Attachments
bugdoc (1.43 KB, text/html) 2009-07-03 08:33 UTC, hdu@apache.org	no flags	Details
Add an attachment (proposed patch, testcase, etc.)

Note You need to log in before you can comment on or make changes to this issue.

Description hdu@apache.org 2009-07-03 08:30:23 UTC

The HTML filter uses the 16bit type sal_Unicode for all its text processing needs and so it strips of the 
most significant bits of unicodes beyond the baseplane. This results in a mangled import.

Comment 1 hdu@apache.org 2009-07-03 08:33:06 UTC

Created attachment 63344 [details]
bugdoc

Comment 2 hdu@apache.org 2009-07-03 08:38:13 UTC

Fixing the method "sal_Unicode CSS1Parser::GetNextChar()" in sw/source/filter/html/parcss1.cxx is 
probably a good starting point.

Comment 3 openoffice 2009-07-03 08:54:58 UTC

set target

Comment 4 Marcus 2017-05-20 11:18:15 UTC

Reset assigne to the default "issues@openoffice.apache.org".

Comment 5 damjan 2023-01-03 09:37:09 UTC

(In reply to hdu@apache.org from comment #2)
> Fixing the method "sal_Unicode CSS1Parser::GetNextChar()" in
> sw/source/filter/html/parcss1.cxx is 
> probably a good starting point.

Yes but that's just CSS parsing, the remainder of the HTML parsing is in main/svtools/source/svhtml/parhtml.cxx, which, sadly like most of our codebase, also operates one Unicode code unit at a time, retrieved from  SvParser::GetNextChar().

The function
inline sal_uInt16 GetCharSize() const;
got my hopes up, does it tell us the code point size?

inline sal_uInt16 SvParser::GetCharSize() const
{
    return (RTL_TEXTENCODING_UCS2 == eSrcEnc) ? 2 : 1;
}

No, just the bytes per BMP character for the current encoding, a useless statistic.

SvParser does not have any functions for code points. We'd have to add them and change a lot of code - not just HTML parsing - to use them.