简体   繁体   中英

How to open a German csv file with pandas?

Question

What is the best way to open a German csv file with pandas?

I have a German csv file with the following columns:

  • Datum: Date in the format 'DD.MM.YYYY'
  • Umlaute: German names with special characters specific to the German language
  • Zahlen: Numbers in the format '000.000,00'

My expected output is:

            Umlaute      Zahlen
Datum                          
2020-01-01  Rüdiger  1000000.11
2020-01-02  Günther       12.34
2020-01-03   Jürgen      567.89

Sample data is provided below (see File).


1st attempt: Use pd.read_csv() without parameters

    df = pd.read_csv('german_csv_test.csv')

This throws an UnicodeDecodeError :

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 12: invalid start byte

2nd attempt: Use pd.read_csv with specifying encoding and separation

  df = pd.read_csv('german_csv_test.csv', sep=';', encoding='latin1')

This throws no error, but it is far from my desired output:

  • The dates are strings not datetimes.
  • The numbers aren't float, but objects.
  • The column 'Datum' is not the index.
        Datum  Umlaute          Zahlen
0  01.01.2020  Rüdiger   1.000.000,11 
1  02.01.2020  Günther          12,34 
2  03.01.2020   Jürgen         567,89 

3rd attempt: Cleaning up

df = pd.read_csv('german_csv_test.csv', sep=';', encoding='latin1')
df['Datum'] = pd.to_datetime(df['Datum'])
df = df.set_index('Datum')
df['Zahlen'] = pd.to_numeric(df['Zahlen'])

Now, I have four lines of code and it still does not work. The last line throws an error ValueError: Unable to parse string " 1.000.000,11 " at position 0 . If I comment the last line out, it works. But the dates still are wrong, because day and month are switched.

            Umlaute          Zahlen
Datum                              
2020-01-01  Rüdiger   1.000.000,11 
2020-02-01  Günther          12,34 
2020-03-01   Jürgen         567,89 

File

My file german_csv_test.csv looks like this:

Datum;Umlaute;Zahlen
01.01.2020;Rüdiger; 1.000.000,11 
02.01.2020;Günther; 12,34 
03.01.2020;Jürgen; 567,89 

It is encoded as 'cp1252'. I saved it on Windows with the option "CSV (MS-DOS)".

Solution

    converters = {'Datum': lambda x: pd.to_datetime(x, format='%d.%m.%Y')}
    df1 = pd.read_csv('german_csv_test.csv', sep=';', thousands='.', decimal=',', encoding='latin1',
                      converters=converters, index_col='Datum')

German csv files are tricky because they look fine at first glance, but the data types are all wrong and the switch between month and day can be frustrating. The above parameters are working for a wide range of European csv files. In the following I'll explain every parameter.

Parameter sep=';'

Almost all German csv files use the semicolon ';' as separation character. This holds for most European countries. You could argue that this is wrong, because csv means "comma separated values". But this is not about right or wrong, it is about convention. And you could say that csv stands for "character separated values" .

Parameters thousands='.' and decimal=','

Also, most European countries use the dot to group thousands and the comma to separate the decimals. This great article explains why.

Parameter encoding='latin1'

If you look up the German encoding in the Python documentation you will see the codec 'cp273' for the German language. It is rarely used. You should be fine with 'latin1' for Western Europe. Using this codec benefits from an internal optimization in CPython:

CPython implementation detail : Some common encodings can bypass the codecs lookup machinery to improve performance. These optimization opportunities are only recognized by CPython for a limited set of (case insensitive) aliases: utf-8, utf8, latin-1, latin1, iso-8859-1, iso8859-1, mbcs (Windows only), ascii, us-ascii, utf-16, utf16, utf-32, utf32, and the same using underscores instead of dashes. Using alternative aliases for these encodings may result in slower execution.

For futher reading look up this SO post and Joel Spolsky's blog .

Parameter converters=converters

Converters are underappreciated by most pandas users. It looks like a complicated solution to a simple problem. Why not use pd.to_datetime() after reading the file? You want to separate your input from processing the data (see IPO model ).

I have seen (and written) something like this so many times:

  df = pd.read_csv('test.csv')
  df['Revenue'] = df['Price'] * df['Quantity']  # I don't have to clean up all columns. I just need the revenue.
  (...)  # Some other code

  # Plotting revenue
  df['Revenue'] = df['Revenue'] / 1000
  df['Date'] = pd.to_datetime(df['Date'])  # Oh, the dates are still strings. I can fix this easily before plotting.

In the next iteration you may move pd.to_datetime() up. But maybe not. And probably this results in some unexpected behavior. Two months after you wrote this kind of code you just see a long sequence of unstructured pandas operations and you think " This is a mess. "

There are several methods to clean your dataframe. But why not use the built-in converters? If you define dtypes and converters for every single column of your dataframe, you don't have to look back (in anger). You stand on firm ground after calling pd.read_csv() .

Be aware that converters only accept functions. This is why I have used a lambda function in the converter. Otherwise I could not specify the format parameter.

Learn more about converters in the documentation and in this SO post

Parameter index_col='Datum'

This just defines the index column. It is handy because the alternative df = df.set_index('Datum') is not that pretty. Also, it helps - like the converters - separating the input block from the data processing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM