简体   繁体   English

如何使用 pandas 打开德国 csv 文件?

[英]How to open a German csv file with pandas?

Question问题

What is the best way to open a German csv file with pandas?用 pandas 打开德国 csv 文件的最佳方法是什么?

I have a German csv file with the following columns:我有一个德国 csv 文件,其中包含以下列:

  • Datum: Date in the format 'DD.MM.YYYY'基准:日期格式为 'DD.MM.YYYY'
  • Umlaute: German names with special characters specific to the German language变音符号:带有特定于德语的特殊字符的德语名称
  • Zahlen: Numbers in the format '000.000,00' Zahlen:格式为“000.000,00”的数字

My expected output is:我预期的 output 是:

            Umlaute      Zahlen
Datum                          
2020-01-01  Rüdiger  1000000.11
2020-01-02  Günther       12.34
2020-01-03   Jürgen      567.89

Sample data is provided below (see File).下面提供了示例数据(见文件)。


1st attempt: Use pd.read_csv() without parameters第一次尝试:使用不带参数的 pd.read_csv()

    df = pd.read_csv('german_csv_test.csv')

This throws an UnicodeDecodeError :这会引发UnicodeDecodeError

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 12: invalid start byte

2nd attempt: Use pd.read_csv with specifying encoding and separation第二次尝试:使用 pd.read_csv 指定编码和分隔

  df = pd.read_csv('german_csv_test.csv', sep=';', encoding='latin1')

This throws no error, but it is far from my desired output:这不会引发错误,但与我想要的 output 相差甚远:

  • The dates are strings not datetimes.日期是字符串而不是日期时间。
  • The numbers aren't float, but objects.数字不是浮动的,而是对象。
  • The column 'Datum' is not the index. “基准”列不是索引。
        Datum  Umlaute          Zahlen
0  01.01.2020  Rüdiger   1.000.000,11 
1  02.01.2020  Günther          12,34 
2  03.01.2020   Jürgen         567,89 

3rd attempt: Cleaning up第三次尝试:清理

df = pd.read_csv('german_csv_test.csv', sep=';', encoding='latin1')
df['Datum'] = pd.to_datetime(df['Datum'])
df = df.set_index('Datum')
df['Zahlen'] = pd.to_numeric(df['Zahlen'])

Now, I have four lines of code and it still does not work.现在,我有四行代码,但它仍然不起作用。 The last line throws an error ValueError: Unable to parse string " 1.000.000,11 " at position 0 .最后一行抛出错误ValueError: Unable to parse string " 1.000.000,11 " at position 0 If I comment the last line out, it works.如果我将最后一行注释掉,它会起作用。 But the dates still are wrong, because day and month are switched.但是日期仍然是错误的,因为日期和月份被交换了。

            Umlaute          Zahlen
Datum                              
2020-01-01  Rüdiger   1.000.000,11 
2020-02-01  Günther          12,34 
2020-03-01   Jürgen         567,89 

File文件

My file german_csv_test.csv looks like this:我的文件german_csv_test.csv看起来像这样:

Datum;Umlaute;Zahlen
01.01.2020;Rüdiger; 1.000.000,11 
02.01.2020;Günther; 12,34 
03.01.2020;Jürgen; 567,89 

It is encoded as 'cp1252'.它被编码为“cp1252”。 I saved it on Windows with the option "CSV (MS-DOS)".我使用“CSV (MS-DOS)”选项将它保存在 Windows 上。

Solution解决方案

    converters = {'Datum': lambda x: pd.to_datetime(x, format='%d.%m.%Y')}
    df1 = pd.read_csv('german_csv_test.csv', sep=';', thousands='.', decimal=',', encoding='latin1',
                      converters=converters, index_col='Datum')

German csv files are tricky because they look fine at first glance, but the data types are all wrong and the switch between month and day can be frustrating.德国 csv 文件很棘手,因为它们乍一看还不错,但数据类型都错误,月份和日期之间的切换可能会令人沮丧。 The above parameters are working for a wide range of European csv files.以上参数适用于各种欧洲 csv 文件。 In the following I'll explain every parameter.下面我将解释每个参数。

Parameter sep=';'参数sep=';'

Almost all German csv files use the semicolon ';'几乎所有德国 csv 文件都使用分号 ';' as separation character.作为分离字符。 This holds for most European countries.这适用于大多数欧洲国家。 You could argue that this is wrong, because csv means "comma separated values".您可能会认为这是错误的,因为 csv 的意思是“逗号分隔值”。 But this is not about right or wrong, it is about convention.但这不是关于对与错,而是关于惯例。 And you could say that csv stands for "character separated values" .您可以说 csv 代表“字符分隔值”

Parameters thousands='.'参数thousands='.' and decimal=','decimal=','

Also, most European countries use the dot to group thousands and the comma to separate the decimals.此外,大多数欧洲国家/地区使用点对千位进行分组,并使用逗号分隔小数。 This great article explains why.这篇很棒的文章解释了原因。

Parameter encoding='latin1'参数encoding='latin1'

If you look up the German encoding in the Python documentation you will see the codec 'cp273' for the German language.如果您在Python 文档中查找德语编码,您将看到德语的编解码器“cp273”。 It is rarely used.它很少使用。 You should be fine with 'latin1' for Western Europe.西欧的“latin1”应该没问题。 Using this codec benefits from an internal optimization in CPython:使用此编解码器受益于 CPython 中的内部优化:

CPython implementation detail : Some common encodings can bypass the codecs lookup machinery to improve performance. CPython 实现细节:一些常见的编码可以绕过编解码器查找机制来提高性能。 These optimization opportunities are only recognized by CPython for a limited set of (case insensitive) aliases: utf-8, utf8, latin-1, latin1, iso-8859-1, iso8859-1, mbcs (Windows only), ascii, us-ascii, utf-16, utf16, utf-32, utf32, and the same using underscores instead of dashes.这些优化机会仅被 CPython 识别为一组有限的(不区分大小写)别名:utf-8、utf8、latin-1、latin1、iso-8859-1、iso8859-1、mbcs(仅限 Windows)、ascii、us-ascii , utf-16, utf16, utf-32, utf32 和相同的使用下划线而不是破折号。 Using alternative aliases for these encodings may result in slower execution.对这些编码使用替代别名可能会导致执行速度变慢。

For futher reading look up this SO post and Joel Spolsky's blog .如需进一步阅读,请查看此 SO 帖子Joel Spolsky 的博客

Parameter converters=converters参数converters=converters

Converters are underappreciated by most pandas users.大多数 pandas 用户都低估了转换器。 It looks like a complicated solution to a simple problem.它看起来像一个简单问题的复杂解决方案。 Why not use pd.to_datetime() after reading the file?读取文件后为什么不使用pd.to_datetime() You want to separate your input from processing the data (see IPO model ).您希望将输入与处理数据分开(参见IPO model )。

I have seen (and written) something like this so many times:我已经多次看到(并写过)这样的事情:

  df = pd.read_csv('test.csv')
  df['Revenue'] = df['Price'] * df['Quantity']  # I don't have to clean up all columns. I just need the revenue.
  (...)  # Some other code

  # Plotting revenue
  df['Revenue'] = df['Revenue'] / 1000
  df['Date'] = pd.to_datetime(df['Date'])  # Oh, the dates are still strings. I can fix this easily before plotting.

In the next iteration you may move pd.to_datetime() up.在下一次迭代中,您可以将pd.to_datetime()向上移动。 But maybe not.但也许不是。 And probably this results in some unexpected behavior.这可能会导致一些意想不到的行为。 Two months after you wrote this kind of code you just see a long sequence of unstructured pandas operations and you think " This is a mess. "编写此类代码两个月后,您只看到一长串非结构化的 pandas 操作,您会认为“这是一团糟。

There are several methods to clean your dataframe.有几种方法可以清洁您的 dataframe。 But why not use the built-in converters?但是为什么不使用内置转换器呢? If you define dtypes and converters for every single column of your dataframe, you don't have to look back (in anger).如果您为 dataframe 的每一列定义dtypesconverters ,您不必回头(愤怒地)。 You stand on firm ground after calling pd.read_csv() .调用pd.read_csv()后,您站在了坚实的基础上。

Be aware that converters only accept functions.请注意,转换器仅接受函数。 This is why I have used a lambda function in the converter.这就是我在转换器中使用 lambda function 的原因。 Otherwise I could not specify the format parameter.否则我无法指定格式参数。

Learn more about converters in the documentation and in this SO post文档此 SO 帖子中了解有关转换器的更多信息

Parameter index_col='Datum'参数index_col='Datum'

This just defines the index column.这只是定义了索引列。 It is handy because the alternative df = df.set_index('Datum') is not that pretty.这很方便,因为替代df = df.set_index('Datum')不是那么漂亮。 Also, it helps - like the converters - separating the input block from the data processing.此外,它有助于 - 像转换器一样 - 将输入块与数据处理分开。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM