LFS Forum - When does Comma Seperated Value not mean...

#1 - Pabs(Sco)

When does Comma Seperated Value not mean...

Mon 31 Oct 2005, 15:49

When does Comma Seperated Value not mean Comma Seperated Value?

When your working on Euro style documents.

I am wrighting a program to read in values from a CSV file and add these values to a structure.

Works fine in English as I am using a ',' as a token to find the data, but I just checked a csv saved on a German OS and the "Comma Separated Value" is a ';'

So when does "Comma Seperated Value" not mean what it says, everywhere but English!!! :hbomb:

#2 - SabersKunk

Mon 31 Oct 2005, 16:52

Don't some euros use commas as decimal point?

#3 - Tweaker

Mon 31 Oct 2005, 16:57

Have no idea what you're talking about Pabs

#4 - the_angry_angel

Mon 31 Oct 2005, 17:13

Hm, thats a bit of a bitch Pabs.....Having never done anything cross language before I've not had to battle with this (yet).

Assuming you're programming on windows, can you not pickup the comma value from the language settings (assuming its reliant on that)?

#5 - Pabs(Sco)

Mon 31 Oct 2005, 17:16

Good idea the_angry_angel picking it up in Widows, never thought of that

Tweak I was just ranting anyway, but at least I have an idea on how to solve it :P

#6 - inCogNito

Mon 31 Oct 2005, 17:37

SabersKunk is right.
in germany for example we use "," as decimal separator (like 3,14159265) and "." for sorting "thoundands" (like 1.000.000).

So it would be pretty inconvienent if you separate decimal numbers by commas

iirc german excel exports to ";" while englisch excel exports to ","

#7 - TagForce

Mon 31 Oct 2005, 17:41

Since when does germany use ; for a ,?

CSV should mean character seperated value... I've seen the weirdest characters used as delimiters.

#8 - hrtburnout

Mon 31 Oct 2005, 17:58

Quote from SabersKunk :Don't some euros use commas as decimal point?

yep, we do that here in The Netherlands

#9 - Anarchi-H

Mon 31 Oct 2005, 18:16



<?php 
php
$file = 'test.csv';
$file = file($file);
$encapsulator = '';
$delimiter = '';

if(is_array($file))
{
    $lastLine = trim($file[count($file)-1]);
    $encapsulator = substr($lastLine, -1, 1);
    echo "Encapsulator is {$encapsulator}<br />";
    $delimPos = strrpos(substr($lastLine, 0, strlen($lastLine)-1), $encapsulator);
    $delimiter = substr($lastLine, $delimPos-1, 1);
    echo "Delimiter is {$delimiter}";
    
}
else
{
    echo 'Error<br />';
    print_r($file);
}


?>

To summarize, it reads the last line of the csv and strips whitespace from the beginning and end. Then assumes the last character to be an encapsulator.
Using the encapsulator, it scans backwards for the next encapsulator (the start of the last field), and uses the character before that as the delimiter.
Obviously, if there is only one column on the last line you wont get teh delimiter because you'll be at the start of the line, so you'd need to check for that and scan the next line up.

The reason you have to start from the end of the file is because you can't guaruntee the encapsulators will be present at the start of the file, but they are always present at the end.

Anyway, I'm pretty sure that should work for what you need, as long as you can translate it in to whatever language you use and add appropriate checks.

#10 - inCogNito

Mon 31 Oct 2005, 23:11

Quote from TagForce :Since when does germany use ; for a ,?

CSV should mean character seperated value... I've seen the weirdest characters used as delimiters.

ask microsoft! i said excel uses ; in CSV files as the delimiter.
Because you can't use commas.

#11 - al heeley

Mon 31 Oct 2005, 23:55

As far as I know, every european country except UK uses commas where we would use decimal points, and vice versa. And Germany tends to group quantities by the hundreds of kilos rather than the rest of the world who uses tonnes. (in my industry anyway

)

#12 - the_angry_angel

Tue 1 Nov 2005, 00:31

Clever bastard Anarchi

It kinda assumes that the encapsulator is 1 character though

(what about CRLF's? or am I missing something obvious there)

Pabs, to be exact, I think the setting you need to pickup from the regional settings would be "list separator" (assuming you dont do it Anarchi's way).

#13 - Anarchi-H

Tue 1 Nov 2005, 05:22

Quote from the_angry_angel :Clever bastard Anarchi

It kinda assumes that the encapsulator is 1 character though

(what about CRLF's? or am I missing something obvious there)

Pabs, to be exact, I think the setting you need to pickup from the regional settings would be "list separator" (assuming you dont do it Anarchi's way).

Trim will ditch CRLFs since they are counted as whitespace at the beginning or end of a string.
Hadn't thought about multiple char encapsulators, but OpenOffice falls over on them, so I'm assuming they are either never, or rarely used.

The main problem I see with using regional settings is , it will work ok if the CSV is created in the same encoding as the OS is running, but what happens when your german pal sends you (in the uk) a csv file to try?
It'll likely fall over since it would assume a comma based on regional settings, but it would of course be a semi colon.

#14 - Anarchi-H

Tue 1 Nov 2005, 07:39

Hmm, appears I was wrong about th encapsulator being a sure fire thing at the end of the last line.

If any rows beforehand have more columns than the last row, the last row will have a separator as the last char, hence giving the wrong results.

It's actually quite a complex problem, which I don't think can be 100% reliably solved by a non-interactive algorithm.

It's definitely given me something to ponder on though

#15 - dUmAsS

Tue 1 Nov 2005, 08:20

Quote from Anarchi-H :It's definitely given me something to ponder on though

help us all

#16 - the_angry_angel

Tue 1 Nov 2005, 09:25

The best solution I've come across is by the guys at planetarion - they dump some stats each game tick and what they used to do was put the delimiter and the seperator in a header (along with version, etc.). This made writing little [web]apps to parse the logs almost infinitely dump version independant.

Unfortunately its certainly not a standard, and I dont think we'd have any luck convincing microsoft to do it in excel, somehow

Another problem is that not all CSV records use an encapsulator for each item. "Bugger". I think you're right, and that the best solution would be to write a program to guess and then present the user with a choice and a small copy of the CSV file. Of course this all goes tits up if you need to run things automagically.

Definately an arse of a problem.

#17 - Anarchi-H

Tue 1 Nov 2005, 12:55

Ok, I think I have (fairly reliably) come up with a solution.

Read the file as one big string.
Copy in to buffer and strip all alphanumeric characters. (/[\w*][\d*]/)
Now process the string line by line and count the occurence of each character (line by line basis)
Now reprocess the occurence data we just collected and count the amount of times the same character appears the same amount of times on each line. (i.e. how many lines have 10 semi-colons on etc)
Then sort the characters by the line count; ignoring it if it is a " or ' and going to the next most frequent
The character at the top would be your delimiter.

Limitations are that malformed csvs wont work if they have a lot of mismatched column counts, and it assumes that either " or ' is an encapsulator (or there isn't one).

My proofing code works on most data, although it still can be fooled, but unless you are specifically trying to trick it, it is fairly reliable.
My code is very PHP specific though. If you want to see it i'll post it, but I presume you are coding in something other than PHP.