[英]How to open a German csv file with pandas?
What is the best way to open a German csv file with pandas?用 pandas 打开德国 csv 文件的最佳方法是什么?
I have a German csv file with the following columns:我有一个德国 csv 文件,其中包含以下列:
My expected output is:我预期的 output 是:
Umlaute Zahlen
Datum
2020-01-01 Rüdiger 1000000.11
2020-01-02 Günther 12.34
2020-01-03 Jürgen 567.89
Sample data is provided below (see File).下面提供了示例数据(见文件)。
df = pd.read_csv('german_csv_test.csv')
This throws an UnicodeDecodeError
:这会引发
UnicodeDecodeError
:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 12: invalid start byte
df = pd.read_csv('german_csv_test.csv', sep=';', encoding='latin1')
This throws no error, but it is far from my desired output:这不会引发错误,但与我想要的 output 相差甚远:
Datum Umlaute Zahlen
0 01.01.2020 Rüdiger 1.000.000,11
1 02.01.2020 Günther 12,34
2 03.01.2020 Jürgen 567,89
df = pd.read_csv('german_csv_test.csv', sep=';', encoding='latin1')
df['Datum'] = pd.to_datetime(df['Datum'])
df = df.set_index('Datum')
df['Zahlen'] = pd.to_numeric(df['Zahlen'])
Now, I have four lines of code and it still does not work.现在,我有四行代码,但它仍然不起作用。 The last line throws an error
ValueError: Unable to parse string " 1.000.000,11 " at position 0
.最后一行抛出错误
ValueError: Unable to parse string " 1.000.000,11 " at position 0
。 If I comment the last line out, it works.如果我将最后一行注释掉,它会起作用。 But the dates still are wrong, because day and month are switched.
但是日期仍然是错误的,因为日期和月份被交换了。
Umlaute Zahlen
Datum
2020-01-01 Rüdiger 1.000.000,11
2020-02-01 Günther 12,34
2020-03-01 Jürgen 567,89
My file german_csv_test.csv
looks like this:我的文件
german_csv_test.csv
看起来像这样:
Datum;Umlaute;Zahlen
01.01.2020;Rüdiger; 1.000.000,11
02.01.2020;Günther; 12,34
03.01.2020;Jürgen; 567,89
It is encoded as 'cp1252'.它被编码为“cp1252”。 I saved it on Windows with the option "CSV (MS-DOS)".
我使用“CSV (MS-DOS)”选项将它保存在 Windows 上。
converters = {'Datum': lambda x: pd.to_datetime(x, format='%d.%m.%Y')}
df1 = pd.read_csv('german_csv_test.csv', sep=';', thousands='.', decimal=',', encoding='latin1',
converters=converters, index_col='Datum')
German csv files are tricky because they look fine at first glance, but the data types are all wrong and the switch between month and day can be frustrating.德国 csv 文件很棘手,因为它们乍一看还不错,但数据类型都错误,月份和日期之间的切换可能会令人沮丧。 The above parameters are working for a wide range of European csv files.
以上参数适用于各种欧洲 csv 文件。 In the following I'll explain every parameter.
下面我将解释每个参数。
sep=';'
sep=';'
Almost all German csv files use the semicolon ';'几乎所有德国 csv 文件都使用分号 ';' as separation character.
作为分离字符。 This holds for most European countries.
这适用于大多数欧洲国家。 You could argue that this is wrong, because csv means "comma separated values".
您可能会认为这是错误的,因为 csv 的意思是“逗号分隔值”。 But this is not about right or wrong, it is about convention.
但这不是关于对与错,而是关于惯例。 And you could say that csv stands for "character separated values" .
您可以说 csv 代表“字符分隔值” 。
thousands='.'
thousands='.'
and decimal=','
decimal=','
Also, most European countries use the dot to group thousands and the comma to separate the decimals.此外,大多数欧洲国家/地区使用点对千位进行分组,并使用逗号分隔小数。 This great article explains why.
这篇很棒的文章解释了原因。
encoding='latin1'
encoding='latin1'
If you look up the German encoding in the Python documentation you will see the codec 'cp273' for the German language.如果您在Python 文档中查找德语编码,您将看到德语的编解码器“cp273”。 It is rarely used.
它很少使用。 You should be fine with 'latin1' for Western Europe.
西欧的“latin1”应该没问题。 Using this codec benefits from an internal optimization in CPython:
使用此编解码器受益于 CPython 中的内部优化:
CPython implementation detail : Some common encodings can bypass the codecs lookup machinery to improve performance.
CPython 实现细节:一些常见的编码可以绕过编解码器查找机制来提高性能。 These optimization opportunities are only recognized by CPython for a limited set of (case insensitive) aliases: utf-8, utf8, latin-1, latin1, iso-8859-1, iso8859-1, mbcs (Windows only), ascii, us-ascii, utf-16, utf16, utf-32, utf32, and the same using underscores instead of dashes.
这些优化机会仅被 CPython 识别为一组有限的(不区分大小写)别名:utf-8、utf8、latin-1、latin1、iso-8859-1、iso8859-1、mbcs(仅限 Windows)、ascii、us-ascii , utf-16, utf16, utf-32, utf32 和相同的使用下划线而不是破折号。 Using alternative aliases for these encodings may result in slower execution.
对这些编码使用替代别名可能会导致执行速度变慢。
For futher reading look up this SO post and Joel Spolsky's blog .如需进一步阅读,请查看此 SO 帖子和Joel Spolsky 的博客。
converters=converters
converters=converters
Converters are underappreciated by most pandas users.大多数 pandas 用户都低估了转换器。 It looks like a complicated solution to a simple problem.
它看起来像一个简单问题的复杂解决方案。 Why not use
pd.to_datetime()
after reading the file?读取文件后为什么不使用
pd.to_datetime()
? You want to separate your input from processing the data (see IPO model ).您希望将输入与处理数据分开(参见IPO model )。
I have seen (and written) something like this so many times:我已经多次看到(并写过)这样的事情:
df = pd.read_csv('test.csv')
df['Revenue'] = df['Price'] * df['Quantity'] # I don't have to clean up all columns. I just need the revenue.
(...) # Some other code
# Plotting revenue
df['Revenue'] = df['Revenue'] / 1000
df['Date'] = pd.to_datetime(df['Date']) # Oh, the dates are still strings. I can fix this easily before plotting.
In the next iteration you may move pd.to_datetime()
up.在下一次迭代中,您可以将
pd.to_datetime()
向上移动。 But maybe not.但也许不是。 And probably this results in some unexpected behavior.
这可能会导致一些意想不到的行为。 Two months after you wrote this kind of code you just see a long sequence of unstructured pandas operations and you think " This is a mess. "
编写此类代码两个月后,您只看到一长串非结构化的 pandas 操作,您会认为“这是一团糟。 ”
There are several methods to clean your dataframe.有几种方法可以清洁您的 dataframe。 But why not use the built-in converters?
但是为什么不使用内置转换器呢? If you define
dtypes
and converters
for every single column of your dataframe, you don't have to look back (in anger).如果您为 dataframe 的每一列定义
dtypes
和converters
,您不必回头(愤怒地)。 You stand on firm ground after calling pd.read_csv()
.调用
pd.read_csv()
后,您站在了坚实的基础上。
Be aware that converters only accept functions.请注意,转换器仅接受函数。 This is why I have used a lambda function in the converter.
这就是我在转换器中使用 lambda function 的原因。 Otherwise I could not specify the format parameter.
否则我无法指定格式参数。
Learn more about converters in the documentation and in this SO post在文档和此 SO 帖子中了解有关转换器的更多信息
index_col='Datum'
index_col='Datum'
This just defines the index column.这只是定义了索引列。 It is handy because the alternative
df = df.set_index('Datum')
is not that pretty.这很方便,因为替代
df = df.set_index('Datum')
不是那么漂亮。 Also, it helps - like the converters - separating the input block from the data processing.此外,它有助于 - 像转换器一样 - 将输入块与数据处理分开。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.