[英]Using a utf-8 record separator in pandas read_csv
I have a delimited file where the separation character is the NOT character (¬) and I am unable to parse it using pandas - see below, the columns are not properly split. 我有一个定界文件,其中的分隔字符为NOT字符(¬),并且无法使用pandas对其进行解析-参见下文,这些列未正确拆分。
test = pd.read_csv("file.csv", sep="¬", encoding="latin-1")
test.head(1)
0 1231�XXX7791�BBB9991�22999KKKK...
test.shape
Out[128]: (7001001, 1)
I am using ipython 3.2.0, pandas 0.16.2, 2.7.10.final.0 on OS X Yosemite. 我正在OS X Yosemite上使用ipython 3.2.0,pandas 0.16.2、2.7.10.final.0。
import pandas as pd
df = pd.read_csv('data.csv', sep='\u00AC', encoding ='utf-8', header=None, engine='python')
print(df)
The previous code will give me this, which is what you wanted. 先前的代码将为您提供这正是您想要的。 You just needed to call the correct UTF-8 encoding as a sep
您只需要将正确的UTF-8编码称为sep
0 1 2 3
0 1231 XXX7791 BBB9991 22999KKKK
You need engine=python
because by default pandas
uses engine=c
which does not support regex seperators. 您需要engine=python
因为默认情况下pandas
使用不支持正则表达式分隔符的engine=c
。
From IPython 从IPython
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.