[英]Problems reading CSV file with commas and characters in pandas
I am trying to read a csv file using pandas and the file has a column called Tags which consist of user provided tags and has tags like - , "", '',1950's, 16th-century. 我正在尝试使用pandas读取csv文件,该文件有一个名为Tags的列,其中包含用户提供的标签,并且标签有 - ,“”,'',1950年代,16世纪。 Since these are user provided, there are many special characters which are entered by mistake as well. 由于这些是用户提供的,因此还有许多特殊字符也是错误输入的。 The issue is that I cannot open the csv file using pandas read_csv. 问题是我无法使用pandas read_csv打开csv文件。 It shows error:Cparser, error tokenizing data. 它显示错误:Cparser,错误标记数据。 Can someone help me with reading the csv file into pandas? 有人可以帮我把csv文件读成熊猫吗?
Okay. 好的。 Starting from a badly formatted CSV we can't read: 从格式错误的CSV开始,我们无法阅读:
>>> !cat unquoted.csv
1950's,xyz.nl/user_003,bad, 123
17th,red,flower,xyz.nl/user_001,good,203
"",xyz.nl/user_239,not very,345
>>> pd.read_csv("unquoted.csv", header=None)
Traceback (most recent call last):
File "<ipython-input-40-7d9aadb2fad5>", line 1, in <module>
pd.read_csv("unquoted.csv", header=None)
[...]
File "parser.pyx", line 1572, in pandas._parser.raise_parser_error (pandas/src/parser.c:17041)
CParserError: Error tokenizing data. C error: Expected 4 fields in line 2, saw 6
We can make a nicer version, taking advantage of the fact the last three columns are well-behaved: 我们可以创建一个更好的版本,利用最后三列表现良好的事实:
import csv
with open("unquoted.csv", "rb") as infile, open("quoted.csv", "wb") as outfile:
reader = csv.reader(infile)
writer = csv.writer(outfile)
for line in reader:
newline = [','.join(line[:-3])] + line[-3:]
writer.writerow(newline)
which produces 哪个产生
>>> !cat quoted.csv
1950's,xyz.nl/user_003,bad, 123
"17th,red,flower",xyz.nl/user_001,good,203
,xyz.nl/user_239,not very,345
and then we can read it: 然后我们可以读到它:
>>> pd.read_csv("quoted.csv", header=None)
0 1 2 3
0 1950's xyz.nl/user_003 bad 123
1 17th,red,flower xyz.nl/user_001 good 203
2 NaN xyz.nl/user_239 not very 345
I'd look into fixing this problem at source and getting data in a tolerable format, though. 我会考虑在源头修复此问题并以可容忍的格式获取数据。 Tricks like this shouldn't be necessary, and it would have been very easy for it to be impossible to repair. 这样的技巧不应该是必要的,并且它很容易被修复。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.