Apache OpenOffice (AOO) Bugzilla – Issue 78926
Might apply some magic to CSV import with broken non-escaped quotes in fields
Last modified: 2017-05-20 11:13:13 UTC
I import a CSV file (input) with some columns, make some changes in calc and then export the files as csv again. In the second file (output) I have a lot of empty columns at the end of the real columns. it seems calc exports also the emty columns of my table. Also if I made no changes on the file, when I export the file to csv again it has a great number of empty column.
Created attachment 46282 [details] Input csv file
Created attachment 46284 [details] Output file
1. Your input file is broken. If a quoted field contains a quote the software creating the file escaped the quote using a backslash (\"), which is wrong and may result in data fields being misaligned. Contained quotes must be escaped by doubling them (""). 2. In case an odd number of wrongly escaped quotes is contained within a field it depends on the context how the line is read. For example see row 774 where data from line 774 (starting with "773";"1";"BIGODINI RISCALDANTI \"BA";...) is followed by data from line 774 and following lines until a pairing quote is matched, resulting in a long row up to cell IQ774, which upon export results in that many columns. 3. However, this special case should not result in a line concatenation because the preceding fields up to AI774 were recognized and aligned correctly the same as in other lines that contain an even number of broken quotes. [hey, good piece of software, isn't it? Your data is broken and still it is read almost correctly ;-) except that non-doubled quotes are lost in all fields] 4. I take this issue only to investigate what the special case here exactly is and maybe find some magic to apply. 5. Remember, your input file is broken.
*** Issue 76874 has been marked as a duplicate of this issue. ***
as the reporter to the issue 76874 - which is closed for being a dublicate of this issue (see the number:cool stuff!) - I want to find out, whether it is really the same defect or not 1. The feature that the textdelimiter is constructed from quote/doublequote+colon/semicolon as a must, is a new invention to OO.o That way is ISO- or anythingelse standard? Okay, but it is not very professional to invent something against the data, that can be found in the real world, one has to deal with that. 2. however, that feature is misbehaving anyway. In case of broken import, OO.o should detect the error for the very record/import that is incomplete, and continue after that. But it seem OO.o has not any error-routine for CSV-import. hallo er: right now the main problem is the import-function, and that is defect - I hope you will deal it with not as an academic but as a real problem. Martin
> 1. The feature that the textdelimiter is constructed from > quote/doublequote+colon/semicolon as a must, is a new invention to OO.o > That way is ISO- or anythingelse standard? Okay, but it is not very professional > to invent something against the data, that can be found in the real world, one > has to deal with that. This is simply not true. The behavior regarding quotes is not an invention of OOo, nor is a quote a must if the delimiter or a quote is not present in the field content. However, an embedded quote _must_ be escaped by doubling it. Again, this is not an invention of OOo, this is common CSV handling, and no it's not ISO or any standard, just as CSV is not standardized at all. But there is an RFC 4180 now. You may want to read more about the CSV format at http://www.creativyst.com/Doc/Articles/CSV/CSV01.htm http://tools.ietf.org/html/rfc4180
well er, this is a sort of discussion .. and maybe it is all sort of a misunderstanding. I just learnd, that the CSV-import for my "wrong-build" file is working. The trick/the solution: delete the default text-delimiter from the list-box. In that case the field-delimiter is the ;-semicolon only and all records will be imported. The "-doublequote will not be detected (as expected !) and can be searched and deleted later. So the misunderstanding is, that the behavior changed and that former existing "no Text-delimiter" from the listbox has to be removed manually. Regarding the "start-stop"-function of the text-delimiter ' or " you are right. But that was not the problem. The problem I see is, that the error-handling is not functioning in a proper manner. OO.o should not stop import csv-lines when it sees a logical error (colum-length, delimiter-error) until the file is finished - and than return a error-message. This would help with this issue. to solve my issue 76874 a selection "no delimiter" on the import-dialog would had helped the most. Martin
*** Issue 80385 has been marked as a duplicate of this issue. ***
Another test case from issue 80385: ,"abc" d "ef", currently results in 'abc d "ef"' To not lose data it should result in 'abc" d "ef' Doing so would also lead to ,"a"b, "a", resulting in _one_ field 'a"b, "a' and not two, 'ab' and ' "a"' like it is currently the case. This would then differ from how Excel treats it, but would be more consistent. See also test case documents attached to issue 80385: http://www.openoffice.org/nonav/issues/showattachment.cgi/47340/test1.csv http://www.openoffice.org/nonav/issues/showattachment.cgi/47341/test2.csv http://www.openoffice.org/nonav/issues/showattachment.cgi/47358/test3.csv http://www.openoffice.org/nonav/issues/showattachment.cgi/47359/test4.csv
How should this be handled: ...,"abc" d "ef",... IS the FIELD actually: 1. 'abc" d "ef' AND the application forgot to double the QUOTES (valid assumption) I believe, this is indeed the best solution. Only quotes surrounding the full string (i.e. preceded or followed by the FIELD-Delimiter) should be viewed as ESCAPING-Quotes, everything else should be considered as an ordinary string. [Matching all quotes, as discussed in issue 80385 is another consistent way to recover these broken csv-files.]
*** Issue 80683 has been marked as a duplicate of this issue. ***
@ er I treat dataloss as not acceptable in no way even if "," is rather outdated, you should trust, programs which formed that did know well, how to deal with ' and " If you find new files, it must not be that the new programs work with or even ment to use that behaviour. You need to understand that. one of the ' or " can be the starting/ending delimiter, the other one can be in those braketts in an uncounted number. that is correct. in the example , colon is th only delimiter: ,"abc" d "ef", so for that the result needs to be: ,'"abc" d "ef"', or the delimiter should be substituted be something more reliable like ;"abc" d "ef"; Martin
*** Issue 91825 has been marked as a duplicate of this issue. ***
*** Issue 103939 has been marked as a duplicate of this issue. ***
Continuing the discussion from issue 103939: Actually implementing something like the description of the bEmbeddedLineBreak parameter to ReadCsvLine might already be enough. The current implementation is simpler than described.
*** Issue 84048 has been marked as a duplicate of this issue. ***
*** Issue 107867 has been marked as a duplicate of this issue. ***
I'm observing a bunch of things that I was able to isolate into some sample files, which I'll attach. I'm running OOo 3.2.1 (OOO320m18 Build:9502) on Microsoft Windows 7. I'm using comma (,) as my field separator and double-quote (") as my text delimiter. Here is a summary of the things I discovered: 1. Using an unescaped text delimiter within a field causes the CSV parser to a) include the " in the field b) start searching for second unpaired " later in the file c) include the closing " in the field 2. #1 only occurs if a field before it on the same line was quoted 3. After #1 occurs, it's possible OOo will attempt to read /lots/ of data into a single field which results in the error "the maximum number of rows has been exceeded"
Created attachment 75189 [details] Truncates data without warning (comma field separator, double-quote text delimiter)
Created attachment 75191 [details] Inconsistent text delimiter behavior (comma field separator, double-quote text delimiter)
Created attachment 75192 [details] Confusing "maximum rows exceeded" error (comma field separator, double-quote text delimiter)
*** Issue 117612 has been marked as a duplicate of this issue. ***
Comment on attachment 75192 [details] Confusing "maximum rows exceeded" error (comma field separator, double-quote text delimiter) Attachment 75192 [details] is exactly the bogus error message I just expended lots of energy tracking down. An error message pointing out "unbalanced single/double quote on line 858" would have saved me hours of troubleshooting. Since my file was tab-delimited, unbalanced quote somewhere in the file had not occurred to me. Thank you for your time.
Resolved in LibreOffice 3.6: https://bugs.freedesktop.org/show_bug.cgi?id=48621
(In reply to comment #25) > Resolved in LibreOffice 3.6: > https://bugs.freedesktop.org/show_bug.cgi?id=48621 Confirmed fixed in 3.6.4.2 on Mac OS X: - Attachment #75189 [details] - Attachment #75191 [details] - Attachment #75192 [details] I still get an error loading the original "input" (attachment #46282 [details]): "The data could not be loaded completely because the maximum number of characters per cell was exceeded".
(In reply to comment #26) > Confirmed fixed in 3.6.4.2 on Mac OS X: Sorry, I meant to say "fixed in LibreOffice.org 3.6.4.2 on Mac OS X". I have not yet tested OOo.
Reset assigne to the default "issues@openoffice.apache.org".