What is the best way to open a German csv file with pandas?
I have a German csv file with the following columns:
My expected output is:
Umlaute Zahlen
Datum
2020-01-01 Rüdiger 1000000.11
2020-01-02 Günther 12.34
2020-01-03 Jürgen 567.89
Sample data is provided below (see File).
df = pd.read_csv('german_csv_test.csv')
This throws an UnicodeDecodeError
:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 12: invalid start byte
df = pd.read_csv('german_csv_test.csv', sep=';', encoding='latin1')
This throws no error, but it is far from my desired output:
Datum Umlaute Zahlen
0 01.01.2020 Rüdiger 1.000.000,11
1 02.01.2020 Günther 12,34
2 03.01.2020 Jürgen 567,89
df = pd.read_csv('german_csv_test.csv', sep=';', encoding='latin1')
df['Datum'] = pd.to_datetime(df['Datum'])
df = df.set_index('Datum')
df['Zahlen'] = pd.to_numeric(df['Zahlen'])
Now, I have four lines of code and it still does not work. The last line throws an error ValueError: Unable to parse string " 1.000.000,11 " at position 0
. If I comment the last line out, it works. But the dates still are wrong, because day and month are switched.
Umlaute Zahlen
Datum
2020-01-01 Rüdiger 1.000.000,11
2020-02-01 Günther 12,34
2020-03-01 Jürgen 567,89
My file german_csv_test.csv
looks like this:
Datum;Umlaute;Zahlen
01.01.2020;Rüdiger; 1.000.000,11
02.01.2020;Günther; 12,34
03.01.2020;Jürgen; 567,89
It is encoded as 'cp1252'. I saved it on Windows with the option "CSV (MS-DOS)".
converters = {'Datum': lambda x: pd.to_datetime(x, format='%d.%m.%Y')}
df1 = pd.read_csv('german_csv_test.csv', sep=';', thousands='.', decimal=',', encoding='latin1',
converters=converters, index_col='Datum')
German csv files are tricky because they look fine at first glance, but the data types are all wrong and the switch between month and day can be frustrating. The above parameters are working for a wide range of European csv files. In the following I'll explain every parameter.
sep=';'
Almost all German csv files use the semicolon ';' as separation character. This holds for most European countries. You could argue that this is wrong, because csv means "comma separated values". But this is not about right or wrong, it is about convention. And you could say that csv stands for "character separated values" .
thousands='.'
and decimal=','
Also, most European countries use the dot to group thousands and the comma to separate the decimals. This great article explains why.
encoding='latin1'
If you look up the German encoding in the Python documentation you will see the codec 'cp273' for the German language. It is rarely used. You should be fine with 'latin1' for Western Europe. Using this codec benefits from an internal optimization in CPython:
CPython implementation detail : Some common encodings can bypass the codecs lookup machinery to improve performance. These optimization opportunities are only recognized by CPython for a limited set of (case insensitive) aliases: utf-8, utf8, latin-1, latin1, iso-8859-1, iso8859-1, mbcs (Windows only), ascii, us-ascii, utf-16, utf16, utf-32, utf32, and the same using underscores instead of dashes. Using alternative aliases for these encodings may result in slower execution.
For futher reading look up this SO post and Joel Spolsky's blog .
converters=converters
Converters are underappreciated by most pandas users. It looks like a complicated solution to a simple problem. Why not use pd.to_datetime()
after reading the file? You want to separate your input from processing the data (see IPO model ).
I have seen (and written) something like this so many times:
df = pd.read_csv('test.csv')
df['Revenue'] = df['Price'] * df['Quantity'] # I don't have to clean up all columns. I just need the revenue.
(...) # Some other code
# Plotting revenue
df['Revenue'] = df['Revenue'] / 1000
df['Date'] = pd.to_datetime(df['Date']) # Oh, the dates are still strings. I can fix this easily before plotting.
In the next iteration you may move pd.to_datetime()
up. But maybe not. And probably this results in some unexpected behavior. Two months after you wrote this kind of code you just see a long sequence of unstructured pandas operations and you think " This is a mess. "
There are several methods to clean your dataframe. But why not use the built-in converters? If you define dtypes
and converters
for every single column of your dataframe, you don't have to look back (in anger). You stand on firm ground after calling pd.read_csv()
.
Be aware that converters only accept functions. This is why I have used a lambda function in the converter. Otherwise I could not specify the format parameter.
Learn more about converters in the documentation and in this SO post
index_col='Datum'
This just defines the index column. It is handy because the alternative df = df.set_index('Datum')
is not that pretty. Also, it helps - like the converters - separating the input block from the data processing.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.