简体   繁体   English

Pandas:如何解决“错误标记数据”?

[英]Pandas: How to workaround "error tokenizing data"?

A lot of questions have been already asked about this topic on SO .已经在 SO 上就这个主题提出很多问题。 (and many others). (和许多其他人)。 Among the numerous answers, none of them was really helpful to me so far.在众多答案中,到目前为止,没有一个对我真正有帮助。 If I missed the useful one, please let me know.如果我错过有用的,请告诉我。

I simply would like to read a CSV file with pandas into a dataframe.我只是想将带有熊猫的 CSV 文件读入数据帧。 Sounds like a simple task.听起来像一个简单的任务。

My file Test.csv我的文件Test.csv

1,2,3,4,5
1,2,3,4,5,6
,,3,4,5
1,2,3,4,5,6,7
,2,,4

My code:我的代码:

import pandas as pd
df = pd.read_csv('Test.csv',header=None)

My error:我的错误:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 2, saw 6

My guess about the issue is that Pandas looks to the first line and expects the same number of tokens in the following rows.我对这个问题的猜测是 Pandas 查看第一行并期望在接下来的行中有相同数量的令牌。 If this is not the case it will stop with an error.如果不是这种情况,它将因错误而停止。

In the numerous answers, the suggestions for using options are, eg: error_bad_lines=False or header=None or skiprows=3 and more non-helpful suggestions.在众多答案中,使用选项的建议是,例如: error_bad_lines=Falseheader=Noneskiprows=3以及更多无用的建议。

However, I don't want to ignore any lines or skip.但是,我不想忽略任何行或跳过。 And I don't know in advance how many columns and rows the datafile has.而且我事先不知道数据文件有多少列和行。

So it basically boils down to how to find the maximum number of columns in the datafile.所以它基本上归结为如何找到数据文件中的最大列数。 Is this the way to go?这是要走的路吗? I hoped that there was an easy way to simply read a CSV file which does not have the maximum column number in the first line.我希望有一种简单的方法可以简单地读取第一行没有最大列数的 CSV 文件。 Thank you for any hints.谢谢你的任何提示。 I'm using Python 3.6.3, Pandas 0.24.1 on Win7.我在 Win7 上使用 Python 3.6.3,Pandas 0.24.1。

Thank you @ALollz for the "very fresh" link (lucky coincidence) and @Rich Andrews for pointing out that my example actually is not "strictly correct" CSV data. 感谢@ALollz的“非常新鲜”链接(幸运巧合)和@Rich Andrews指出我的示例实际上不是“严格正确”的CSV数据。

So, the way it works for me for the time being is adapted from @ALollz' compact solution ( https://stackoverflow.com/a/55129746/7295599 ) 所以,它暂时适用于我的方式改编自@ALollz的紧凑型解决方案( https://stackoverflow.com/a/55129746/7295599

### reading an "incorrect" CSV to dataframe having a variable number of columns/tokens 
import pandas as pd

df = pd.read_csv('Test.csv', header=None, sep='\n')
df = df[0].str.split(',', expand=True)
# ... do some modifications with df
### end of code

df contains empty string '' for the missing entries at the beginning and the middle, and None for the missing tokens at the end. df包含开头和中间缺少条目的空字符串'' ,以及最后缺少的标记的None

   0  1  2  3     4     5     6
0  1  2  3  4     5  None  None
1  1  2  3  4     5     6  None
2        3  4     5  None  None
3  1  2  3  4     5     6     7
4     2     4  None  None  None

If you write this again to a file via: 如果您通过以下方式再次将其写入文件:

df.to_csv("Test.tab",sep="\\t",header=False,index=False)

1   2   3   4   5       
1   2   3   4   5   6   
        3   4   5       
1   2   3   4   5   6   7
    2       4           

None will be converted to empty string '' and everything is fine. None会被转换为空字符串'' ,一切都很好。

The next level would be to account for data strings in quotes which contain the separator, but that's another topic. 下一个级别是在包含分隔符的引号中考虑数据字符串,但这是另一个主题。

1,2,3,4,5
,,3,"Hello, World!",5,6
1,2,3,4,5,6,7

In my case 1 I opened the *.csv in Excel 2 I saved the *.csv as CSV (comma-delimited) 3 I loaded the file in python via:就我而言 1 我在 Excel 中打开了 *.csv 2 我将 *.csv 保存为 CSV(逗号分隔) 3 我通过以下方式在 python 中加载了文件:

import pandas as pd
df = pd.read_csv('yourcsvfile.csv', sep=',')

Hope it helps!希望能帮助到你!

Read the csv using the tolerant python csv module, and fix the loaded file prior to handing it off to pandas, which will fails on the otherwise malformed csv data regardless of the csv engine pandas uses. 使用容忍的python csv模块读取csv,并在将其传递给pandas之前修复已加载的文件,无论csv引擎pandas使用什么,这都将导致其他格式错误的csv数据失败。

import pandas as pd
import csv

not_csv = """1,2,3,4,5
1,2,3,4,5,6
,,3,4,5
1,2,3,4,5,6,7
,2,,4
"""

with open('not_a.csv', 'w') as csvfile:
    csvfile.write(not_csv)

d = []
with open('not_a.csv') as csvfile:
    areader = csv.reader(csvfile)
    max_elems = 0
    for row in areader:
        if max_elems < len(row): max_elems = len(row)
    csvfile.seek(0)
    for i, row in enumerate(areader):
        # fix my csv by padding the rows
        d.append(row + ["" for x in range(max_elems-len(row))])

df = pd.DataFrame(d)
print df

# the default engine
# provides "pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 2, saw 6 "
#df = pd.read_csv('Test.csv',header=None, engine='c')

# the python csv engine
# provides "pandas.errors.ParserError: Expected 6 fields in line 4, saw 7 "
#df = pd.read_csv('Test.csv',header=None, engine='python')

Preprocess file outside of python if concerned about extra code inside python creating too much python code. python之外的预处理文件,如果担心python中的额外代码创建了太多的python代码。

Richs-MBP:tmp randrews$ cat test.csv
1,2,3
1,
2
1,2,
,,,
Richs-MBP:tmp randrews$ awk 'BEGIN {FS=","}; {print $1","$2","$3","$4","$5}' < test.csv
1,2,3,,
1,,,,
2,,,,
1,2,,,
,,,,

I have a different take on the solution. 我对解决方案有不同的看法。 Let pandas take care of creating the table and deleting None values and let us take care of writing a proper tokenizer. 让pandas负责创建表并删除None值,让我们负责编写一个合适的tokenizer。

Tokenizer 标记生成器

def tokenize(str):
    idx = [x for x, v in enumerate(str) if v == '\"']
    if len(idx) % 2 != 0:
        idx = idx[:-1]
    memory = {}
    for i in range(0, len(idx), 2):
        val = str[idx[i]:idx[i+1]+1]
        key = "_"*(len(val)-1)+"{0}".format(i)
        memory[key] = val
        str = str.replace(memory[key], key, 1)        
    return [memory.get(token, token) for token in str.split(",")]  

Test cases for Tokenizer Tokenizer的测试用例

print (tokenize("1,2,3,4,5"))
print (tokenize(",,3,\"Hello, World!\",5,6"))
print (tokenize(",,3,\"Hello,,,, World!\",5,6"))
print (tokenize(",,3,\"Hello, World!\",5,6,,3,\"Hello, World!\",5,6"))
print (tokenize(",,3,\"Hello, World!\",5,6,,3,\"Hello,,5,6"))

Output 产量

['1', '2', '3', '4', '5'] ['', '', '3', '"Hello, World!"', '5', '6'] ['', '', '3', '"Hello,,,, World!"', '5', '6'] ['', '', '3', '"Hello, World!"', '5', '6', '', '3', '"Hello, World!"', '5', '6'] ['', '', '3', '"Hello, World!"', '5', '6', '', '3', '"Hello', '', '5', '6']

Putting the tokenizer into action 将标记化器置于行动中

with open("test1.csv", "r") as fp:
    lines = fp.readlines()

lines = list(map(lambda x: tokenize(x.strip()), lines))
df = pd.DataFrame(lines).replace(np.nan, '')

Advantage: 优点:

Now we can teak the tokenizer function as per our needs 现在我们可以根据需要篡改tokenizer功能

For me this was solved by adding usecols to the pd.read_csv() command:对我来说,这是通过将 usecols 添加到 pd.read_csv() 命令来解决的:

usecols=['My_Column_1','My_Column_2',...] usecols=['My_Column_1','My_Column_2',...]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM