简体   繁体   English

在熊猫read_csv中使用utf-8记录分隔符

[英]Using a utf-8 record separator in pandas read_csv

I have a delimited file where the separation character is the NOT character (¬) and I am unable to parse it using pandas - see below, the columns are not properly split. 我有一个定界文件,其中的分隔字符为NOT字符(¬),并且无法使用pandas对其进行解析-参见下文,这些列未正确拆分。

test = pd.read_csv("file.csv", sep="¬", encoding="latin-1")
test.head(1)
0       1231�XXX7791�BBB9991�22999KKKK... 
test.shape
Out[128]: (7001001, 1)

I am using ipython 3.2.0, pandas 0.16.2, 2.7.10.final.0 on OS X Yosemite. 我正在OS X Yosemite上使用ipython 3.2.0,pandas 0.16.2、2.7.10.final.0。

import pandas as pd

df = pd.read_csv('data.csv', sep='\u00AC', encoding ='utf-8', header=None, engine='python')

print(df)

The previous code will give me this, which is what you wanted. 先前的代码将为您提供这正是您想要的。 You just needed to call the correct UTF-8 encoding as a sep 您只需要将正确的UTF-8编码称为sep

      0        1        2          3
0  1231  XXX7791  BBB9991  22999KKKK

You need engine=python because by default pandas uses engine=c which does not support regex seperators. 您需要engine=python因为默认情况下pandas使用不支持正则表达式分隔符的engine=c

From IPython 从IPython

在此处输入图片说明

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM