简体   繁体   English

在python中读取德国csv文件的问题

[英]Problems to read german csv file in python

I am having a german csv file, which I want to read with pd.read_csv .我有一个德国 csv 文件,我想用pd.read_csv读取pd.read_csv

Data:数据:

The original file looks like this:原始文件如下所示:

在此处输入图片说明

So it has two Columns (A,B) and the seperator should be ';'所以它有两列(A,B),分隔符应该是';' , ,

Problem: When I ran the command:问题:当我运行命令时:

dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
                      encoding='utf-8', header=None, sep=';')

I get the error: ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3我收到错误: ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3 ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3

Half-Solution: I understand this could have several reasons, but when I ran the command:半解决方案:我知道这可能有几个原因,但是当我运行命令时:

dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
                      encoding='utf-8', header=None, sep='delimiter')

I get the following dataset back:我得到以下数据集:

    0
0   Etat;Die ARD-Tochter Degeto hat sich verpflich...
1   Etat;App sei nicht so angenommen worden wie ge...
2   Etat;'Zum Welttag der Suizidprävention ist es ...
3   Etat;Mitarbeiter überreichten Eigentümervertre...
4   Etat;Service: Jobwechsel in der Kommunikations...

so I only get one column instead of the two desired columns,所以我只得到一列而不是两列所需的列,

Target: any idea how to load the dataset correctly that I have:目标:知道如何正确加载我拥有的数据集:

    0       1
0   Etat    Die ARD-Tochter Degeto hat sich verpflich...
1   Etat    App sei nicht so angenommen worden wie ge...

Hints/Tries:提示/尝试:

When I run the search function over my data in excel, I am also not finding any ;当我在 excel 中对我的数据运行搜索功能时,我也没有找到任何; in it.在里面。

It seems like that some lines have more then two columns (as you can see for example in line 3 and 13 of my example似乎有些行有两列以上(例如,您可以在我的示例的第 3 行和第 13 行中看到

Skim through your texts carefully.仔细浏览你的文章。 If you find no leads, follow the below solution.如果您没有发现任何线索,请按照以下解决方案进行操作。


Note: This is not a perfect solution but a hack and has worked for me multiple times when I worked with German text since I found no other solution. 注意:这不是一个完美的解决方案,而是一个 hack,并且在我使用德语文本时多次为我工作,因为我没有找到其他解决方案。

I just read the whole thing as such and split the string into two desired columns by splitting on the first occurrence of a delimiter.我只是这样阅读整个事情,并通过在第一次出现分隔符时将字符串拆分为两个所需的列。

 df['col1'] = df[0].str.split(';', 1).str[0] df['col2'] = df[0].str.split(';', 1).str[1]

Output:输出:

 0 col1 col2 0 Etat;Die ARD-Tochter.. Etat Die ARD-Tochter 1 Etat;App sei nicht... Etat App sei nicht 2 Etat;Mitarbeiter überreich.. Etat Mitarbeiter überreich

I just trimmed the texts to demonstrate the example.我只是修剪了文本以演示示例。

One possible solution is create one column DataFrame with separator not in data like delimiter and then use Series.str.split with n parameter and expand=True for new DataFrame :一种可能的解决方案是创建一列DataFrame ,分隔符不在数据中,如delimiter ,然后使用Series.str.splitn参数和expand=True新的DataFrame

dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
                       encoding='utf-8', header=None, sep='delimiter')

#more general solution is use some value NOT exist in data like yen ¥
#dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
#                      encoding='utf-8', header=None, sep='¥')

df = dataset[0].str.split(';', n=1, expand=True)
df.columns = ['A','B']
print (df)

This works for me:这对我有用:

import pandas as pd
df = pd.read_csv('german.txt', sep=';', header = None, encoding='iso-8859-1')
df

Output:输出:

       0    1
0   Etat    Die ARD-Tochter Degeto hat sich verpflich...
1   Etat    App sei nicht so angenommen worden wie ge...
2   Etat    'Zum Welttag der Suizidprävention ist es ...
3   Etat    Mitarbeiter überreichten Eigentümervertre...
4   Etat    Service: Jobwechsel in der Kommunikations...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM