在python中读取德国csv文件的问题

Question

I am having a german csv file, which I want to read with pd.read_csv .我有一个德国 csv 文件，我想用pd.read_csv读取pd.read_csv 。

Data:数据：

The original file looks like this:原始文件如下所示：

So it has two Columns (A,B) and the seperator should be ';'所以它有两列（A，B），分隔符应该是';' , ,

Problem: When I ran the command:问题：当我运行命令时：

dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
                      encoding='utf-8', header=None, sep=';')

I get the error: ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3我收到错误： ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3 ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3

Half-Solution: I understand this could have several reasons, but when I ran the command:半解决方案：我知道这可能有几个原因，但是当我运行命令时：

dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
                      encoding='utf-8', header=None, sep='delimiter')

I get the following dataset back:我得到以下数据集：

    0
0   Etat;Die ARD-Tochter Degeto hat sich verpflich...
1   Etat;App sei nicht so angenommen worden wie ge...
2   Etat;'Zum Welttag der Suizidprävention ist es ...
3   Etat;Mitarbeiter überreichten Eigentümervertre...
4   Etat;Service: Jobwechsel in der Kommunikations...

so I only get one column instead of the two desired columns,所以我只得到一列而不是两列所需的列，

Target: any idea how to load the dataset correctly that I have:目标：知道如何正确加载我拥有的数据集：

    0       1
0   Etat    Die ARD-Tochter Degeto hat sich verpflich...
1   Etat    App sei nicht so angenommen worden wie ge...

Hints/Tries:提示/尝试：

When I run the search function over my data in excel, I am also not finding any ;当我在 excel 中对我的数据运行搜索功能时，我也没有找到任何; in it.在里面。

It seems like that some lines have more then two columns (as you can see for example in line 3 and 13 of my example似乎有些行有两列以上（例如，您可以在我的示例的第 3 行和第 13 行中看到

Answer 1

Skim through your texts carefully.仔细浏览你的文章。 If you find no leads, follow the below solution.如果您没有发现任何线索，请按照以下解决方案进行操作。

Note: This is not a perfect solution but a hack and has worked for me multiple times when I worked with German text since I found no other solution. 注意：这不是一个完美的解决方案，而是一个 hack，并且在我使用德语文本时多次为我工作，因为我没有找到其他解决方案。

I just read the whole thing as such and split the string into two desired columns by splitting on the first occurrence of a delimiter.我只是这样阅读整个事情，并通过在第一次出现分隔符时将字符串拆分为两个所需的列。

 df['col1'] = df[0].str.split(';', 1).str[0] df['col2'] = df[0].str.split(';', 1).str[1]

Output:输出：

 0 col1 col2 0 Etat;Die ARD-Tochter.. Etat Die ARD-Tochter 1 Etat;App sei nicht... Etat App sei nicht 2 Etat;Mitarbeiter überreich.. Etat Mitarbeiter überreich

I just trimmed the texts to demonstrate the example.我只是修剪了文本以演示示例。

Answer 2

One possible solution is create one column DataFrame with separator not in data like delimiter and then use Series.str.split with n parameter and expand=True for new DataFrame :一种可能的解决方案是创建一列DataFrame ，分隔符不在数据中，如delimiter ，然后使用Series.str.split与n参数和expand=True新的DataFrame ：

dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
                       encoding='utf-8', header=None, sep='delimiter')

#more general solution is use some value NOT exist in data like yen ¥
#dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
#                      encoding='utf-8', header=None, sep='¥')

df = dataset[0].str.split(';', n=1, expand=True)
df.columns = ['A','B']
print (df)

Answer 3

This works for me:这对我有用：

import pandas as pd
df = pd.read_csv('german.txt', sep=';', header = None, encoding='iso-8859-1')
df

Output:输出：

       0    1
0   Etat    Die ARD-Tochter Degeto hat sich verpflich...
1   Etat    App sei nicht so angenommen worden wie ge...
2   Etat    'Zum Welttag der Suizidprävention ist es ...
3   Etat    Mitarbeiter überreichten Eigentümervertre...
4   Etat    Service: Jobwechsel in der Kommunikations...

在python中读取德国csv文件的问题

问题描述

3 个解决方案

解决方案1
3 2019-08-16 09:17:29

解决方案2
2 已采纳 2019-08-16 09:19:35

解决方案3
1 2019-08-16 09:28:44

在python中读取德国csv文件的问题

问题描述

3 个解决方案

解决方案1 3 2019-08-16 09:17:29

解决方案2 2 已采纳 2019-08-16 09:19:35

解决方案3 1 2019-08-16 09:28:44

解决方案1
3 2019-08-16 09:17:29

解决方案2
2 已采纳 2019-08-16 09:19:35

解决方案3
1 2019-08-16 09:28:44