[英]Problems to read german csv file in python
I am having a german csv file, which I want to read with pd.read_csv
.我有一个德国 csv 文件,我想用
pd.read_csv
读取pd.read_csv
。
Data:数据:
The original file looks like this:原始文件如下所示:
So it has two Columns (A,B) and the seperator should be ';'
所以它有两列(A,B),分隔符应该是
';'
, ,
Problem: When I ran the command:问题:当我运行命令时:
dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
encoding='utf-8', header=None, sep=';')
I get the error: ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3
我收到错误:
ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3
ParserError: Error tokenizing data. C error: Expected 2 fields in line 3, saw 3
Half-Solution: I understand this could have several reasons, but when I ran the command:半解决方案:我知道这可能有几个原因,但是当我运行命令时:
dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
encoding='utf-8', header=None, sep='delimiter')
I get the following dataset back:我得到以下数据集:
0
0 Etat;Die ARD-Tochter Degeto hat sich verpflich...
1 Etat;App sei nicht so angenommen worden wie ge...
2 Etat;'Zum Welttag der Suizidprävention ist es ...
3 Etat;Mitarbeiter überreichten Eigentümervertre...
4 Etat;Service: Jobwechsel in der Kommunikations...
so I only get one column instead of the two desired columns,所以我只得到一列而不是两列所需的列,
Target: any idea how to load the dataset correctly that I have:目标:知道如何正确加载我拥有的数据集:
0 1
0 Etat Die ARD-Tochter Degeto hat sich verpflich...
1 Etat App sei nicht so angenommen worden wie ge...
Hints/Tries:提示/尝试:
When I run the search function over my data in excel, I am also not finding any ;
当我在 excel 中对我的数据运行搜索功能时,我也没有找到任何
;
in it.在里面。
It seems like that some lines have more then two columns (as you can see for example in line 3 and 13 of my example似乎有些行有两列以上(例如,您可以在我的示例的第 3 行和第 13 行中看到
Skim through your texts carefully.仔细浏览你的文章。 If you find no leads, follow the below solution.
如果您没有发现任何线索,请按照以下解决方案进行操作。
I just read the whole thing as such and split the string into two desired columns by splitting on the first occurrence of a delimiter.我只是这样阅读整个事情,并通过在第一次出现分隔符时将字符串拆分为两个所需的列。
df['col1'] = df[0].str.split(';', 1).str[0] df['col2'] = df[0].str.split(';', 1).str[1]
Output:输出:
0 col1 col2 0 Etat;Die ARD-Tochter.. Etat Die ARD-Tochter 1 Etat;App sei nicht... Etat App sei nicht 2 Etat;Mitarbeiter überreich.. Etat Mitarbeiter überreich
I just trimmed the texts to demonstrate the example.我只是修剪了文本以演示示例。
One possible solution is create one column DataFrame
with separator not in data like delimiter
and then use Series.str.split
with n
parameter and expand=True
for new DataFrame
:一种可能的解决方案是创建一列
DataFrame
,分隔符不在数据中,如delimiter
,然后使用Series.str.split
与n
参数和expand=True
新的DataFrame
:
dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
encoding='utf-8', header=None, sep='delimiter')
#more general solution is use some value NOT exist in data like yen ¥
#dataset = pd.read_csv('C:/Users/.../GermanNews/articles.csv',
# encoding='utf-8', header=None, sep='¥')
df = dataset[0].str.split(';', n=1, expand=True)
df.columns = ['A','B']
print (df)
This works for me:这对我有用:
import pandas as pd
df = pd.read_csv('german.txt', sep=';', header = None, encoding='iso-8859-1')
df
Output:输出:
0 1
0 Etat Die ARD-Tochter Degeto hat sich verpflich...
1 Etat App sei nicht so angenommen worden wie ge...
2 Etat 'Zum Welttag der Suizidprävention ist es ...
3 Etat Mitarbeiter überreichten Eigentümervertre...
4 Etat Service: Jobwechsel in der Kommunikations...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.