在熊猫read_csv中使用utf-8记录分隔符

Question

I have a delimited file where the separation character is the NOT character (¬) and I am unable to parse it using pandas - see below, the columns are not properly split. 我有一个定界文件，其中的分隔字符为NOT字符（¬），并且无法使用pandas对其进行解析-参见下文，这些列未正确拆分。

test = pd.read_csv("file.csv", sep="¬", encoding="latin-1")
test.head(1)
0       1231�XXX7791�BBB9991�22999KKKK... 
test.shape
Out[128]: (7001001, 1)

I am using ipython 3.2.0, pandas 0.16.2, 2.7.10.final.0 on OS X Yosemite. 我正在OS X Yosemite上使用ipython 3.2.0，pandas 0.16.2、2.7.10.final.0。

Answer 1

import pandas as pd

df = pd.read_csv('data.csv', sep='\u00AC', encoding ='utf-8', header=None, engine='python')

print(df)

The previous code will give me this, which is what you wanted. 先前的代码将为您提供这正是您想要的。 You just needed to call the correct UTF-8 encoding as a sep 您只需要将正确的UTF-8编码称为sep

      0        1        2          3
0  1231  XXX7791  BBB9991  22999KKKK

You need engine=python because by default pandas uses engine=c which does not support regex seperators. 您需要engine=python因为默认情况下pandas使用不支持正则表达式分隔符的engine=c 。

From IPython 从IPython

在熊猫read_csv中使用utf-8记录分隔符

问题描述

1 个解决方案

解决方案1
0 2015-10-01 15:10:57

在熊猫read_csv中使用utf-8记录分隔符

问题描述

1 个解决方案

解决方案1 0 2015-10-01 15:10:57

解决方案1
0 2015-10-01 15:10:57