Apache OpenOffice (AOO) Bugzilla – Issue 68639
use the system locale if document does not provide a character
Last modified: 2013-08-07 14:42:49 UTC
Writer incorrectly displays cyrillic characters in .rtf file. On attached "cyrillic-in-rtf.gif" see how Writer displays (circled in red) and other word processor (circled in green). Problematci "bud.rtf" is attached too. This issue greatly affects our company. Please consider fixing it in 2.0.4. Thanks a lot. WBR, K. Palagin.
Created attachment 38525 [details] screenshot
Created attachment 38526 [details] problematic rtf file
invalid. The document is broken. Even word viewer cannot display the document right. It misses the fontcharset declaration. change it so that it does lists the charset and OOo and word-viewer can display it. $ diff ~/Desktop/Downloads/bud.rtf charset.rtf 2c2 < {\fonttbl{\f1 Arial}{\f2 Code 128}} --- > {\fonttbl{\f1 \fcharset204 Arial}{\f2 Code 128}}
closing issue. Bug in external application that creates these RTF docs, not in OOo.
cloph, thanks for your analisys. I suspected that external app is broken, but can't you compensate for other people deficiencies and display cyrillic in readable form? Thanks a lot. WBR, K. Palagin.
It turns out that latest Word Viewer displays them just fine. Please see attached screenshot.
Created attachment 38534 [details] screenshot of worviewer
Word viewer probably tries the system's locale setting. Try this on a german version of Windows and you get wrong results again.
Created attachment 38599 [details] file opened in wordviewer on german version of windows
Any chance OO can behave in the same way as wordviewer? Thanks. WBR, K. Palagin.
Well, there is the possibility that you file an "Enhancement" issue which states something like "use the system locale if document does not provide a character encoding".
(kpalagin - I am reopening this issue as RFE). Please make OO applications to use the system locale if document does not provide a character encoding. Thanks a lot. WBR, K. Palagin.
reassigned to requirements.
Any chance of implementing this in 2.1?
Maybe in 2.2?
Dear developers, please consider this issue for 2.3. Thanks a lot for your attention.
added myself as cc
It seems that initializing the encoding with the system value before parsing is enough; in case an encoding is available it will be set in SvxRTFParser::NextToken() in the "case RTF_DEFLANG:" part. It needs some testing/checking though but if this is all I see no problem with fixing it in 2.4. Thanks for bringing this to my attention on the mailing list. We have just too much issues like this one to remember them all. :-)
Mathias, shall we do the same for other text docs without encoding specified? For example Word6 docs and HTML come to mind. I am attaching sample .html, which does not display correctly ecause of missing charset specification.
Created attachment 49277 [details] zip of html
Mathias, IMO you should set an encoding in SvxRTFParser::ReadFontTable() because RTF_FCHARSET is not invoked if you open bud.rtf. If you suggest me a "right way" of getting default charset (not gsl_getSystemTextEncoding() ;) ), I can provide a patch and some QA. Set CC.
Thanks, that was my plan also. And this indirectly answers Kyrill's last question: the fix will be specific for the RTF filter. The "character set" information is something that is relevant only at the time of import, after that and in memory all strings are UniCode strings anyway. So if other filters also don't import documents that are broken wrt. charset declaration in the expected way it has to be fixed in that filter also.
Mathias, What do you think about using OOo locale (Application::GetSettings().GetLocale()) to choose charset for such documents? This requires some additional effort to prepare suitable conversion table for each supported locale but gives additional flexibility: you don't need to change system locale and restart OOo (just re-open the document) to apply changes. Proposed conversion table can be used at least in ww8 and excel filters. WBR, Rail Aliev
As I don't have a russian Windows version: Kyrill, can you tell me what would be the right encoding for the "bud.rtf" document? Then I can test where I have to set the encoding.
Mathias, it should be RTL_TEXTENCODING_MS_1251
Attached is a prototype which uses OOo locale to "guess" needed encoding. It works for "bud.rtf" if you switch OOo's locale to Russian (no langpack needed) and reopen the document.
Thanks, that looks fine. At least to me - the characters are looking somewhat kyrillic now. :-) I will let one of my russion colleagues check the result.
Created attachment 49311 [details] Prototype
Mathias, in which CWS? I volunteer for QA. :)
The CWS is called mba24isues01. The only file I changed for the fix is svx/source/svrtf/svxrtf.cxx.
forgot to change owner :-)
Thank you very much!!! I have filed http://www.openoffice.org/issues/show_bug.cgi?id=83194 for HTML filter. Please consider that for 2.4 too.
Mathias, it should work on Windows but not on Unix. Try to run OOo with en_US.UTF-8, ru_RU.UTF-8 (or other UTF-8 locale) and you get "wrong format" error and empty file. If you run with non UTF-8 locale, like ISO-8859-1, KOI8-R, CP1251 you get the opened but it appears as needed only for CP1251 which is very rare used these days. I suggest not use osl_getThreadTextEncoding() (which depends on generated libc locales on GNU/Linux for example) and use proposed AllSettings::GetDefaultTextEncoding() (with changes of course) for MS-dependent formats.
Your code looked like a Windows-only solution also (as it uses _MS_...). The hack we are talking about only makes sense if Office and tool are running on the same machine, in all other cases it is just luck. The obviously broken tool that created the rtf documents seems to be a Windows app. If it was a Linux app using osl_getThreadTextEncoding would be fine if OOo on the same machine was used.
> Your code looked like a Windows-only solution also (as it uses _MS_...). But it works on Linux and Windows. ;) I tested it on both platforms. It proposed to be used for MS formats only. > The hack we are talking about only makes sense if Office and tool are running on the same machine, in all other cases it is just luck. You mean osl_getThreadTextEncoding() based solution? ;) Yes, it's a luck. My suggestion is don't forget about users "without a luck" (Unix users) and don't use system locale for MS based formats (RTF, DOC, XLS).
So your assumption is that the broken tools is a tool that runs on Windows and so we should only consider Windows encodings? You don't want to have a fix for the opposite case that the tool runs on Linux and use the system encoding on that machine? Well, that indeed is supported by your proposal. In that case we can perhaps build upon the Windows implementation of osl_getTextEncodingFromLocale,
Thinking a bit about your patch: GetDefaultEncoding() in the way you proposed it should be a private method of the RTFParser. This way to detect an encoding is a special hack for MS file format based Windows tools. So we shouldn't move that code to VCL. The proper default encoding should be what you get from osl_getThreadTextEncoding or osl_getTextEncodingFromLocale. Our hack only is for this special case. And it failed for the same broken tools located on Linux. Besides that I would be more confident if we had a clear explanation for which encoding should be chosen for what encoding. On Windows osl_getTextEncodingFromLocale works fine at least for the "uk","tr" and "ru" locales but I think it should be fine to use the same method for other locales. But I currently don't know what implementation is used on other platforms.
Mathias, I agree with you that osl_getThreadTextEncoding is more generec than usein cp125x locales. I have a collection of files with wrong or unknown encoding for RTF, XLS and DOC files and all of them cp125x based files, still no ISO-8859-x or others. IMO, the best and flexible solution will be creating a new config item whith fallback encoding for such documents and set ut to cp125x (depending on locale) by default. User may change it via Tools-Options. I used vcl because that function depends on OOo config, but it can be moved anywhere where it can be shared by sc, sc and maybe binfilter.
Thank you very much for this inhancement! When would this CWS be integrated into master?
I hope that the CWS can be handed over to QA as soon as they have finished their work on all UI/help relevant CWS that need to be integrated until UI freeze. So perhaps in a week or so.
@rail: I forgot to add that I accepted your way to detect the locale. For rtf it makes sense to do it "Windows like". I didn't add this to the Settings class though, it should stay an RTF specific function. Thanks for the patch.
As I don't have a russion OS I would like to ask for testing. Some remarks about doc and xls: these are different filters and should be treated by different issues. In case the filter developers accepted the same treatment we can move the currently local function in rtf filter somewhere else.
Tried m239 on WinXP english with Russian localle (see attached screenshots) - attached "bud.rtf" still does not open in Russian. I guess this issue should be reopened then?
Created attachment 50331 [details] Regional options
Created attachment 50332 [details] Regional options
@Kirill: mba24issues01 is not integrated yet (http://tools.services.openoffice.org/EIS2/cws.ShowCWS?logon=true&Id=5912&Path=SRC680%2Fmba24issues01). @Mathias: I think we should add other languages in the lang->ms_encoding table too. What do you think to add such table to i18npool?
It's still not in the master. Sorry, I should have mentioned that. The Child Workspace should become "Ready for QA" on monday. @rail: extending the table surely is a good idea. For the moment we can leave it in svx. This is Windows specific, so I hesitate to put it into i18n. Maybe Eike can give his opinion about this. Eike: if we wanted to have a table that maps ISO locales to ISO charsets like Windows does it - would that be something we can put into i18n or should we start thinking about a collection of MS specific things (I remember the date coding, the storage format tricks etc.)? The idea of such a table would be to treat documents in MS specific formats better in case the application creating them "forgot" to specify the locale. The documents work well with MS Office because they fall back to the Windows system locale. The idea is having this fallback to the Windows locale for OOo on all(!) platforms also.
If someone writes me an issue about the XLS filters, please remember to add a link to this issue ;-)
@mba: I think adding such locale -> text encoding information would be best suited somewhere near rtl_TextEncodingInfo and some rtl_getTextEncodingFrom...() function, see sal/inc/rtl/tencinfo.h 'sb' might have some opinion/suggestion/objection.
Stefan, for testing you will need an operation system with a russion locale. The attached document should show kyrillic letters then.
Verified in CWS mba24issues01 on Windows and Linux. My test scenario: (1) The intuitive case: Having set the system Locale to Russian (on Linux, I tried ru_RU), the file opens fine. Note: No native Russian Windows needed for this, tried with English WinXP. (2) The workaround for non-russian locales Having another locale set, the file opens "not good", same as before. But then, Tools-Options-Language settings-Languages -> set "Locale setting" to "Russian", OK File - Reload -> Voila, readible.
OK in OOH680_m7. OK in SRC680_m247. Closed.