简体   繁体   English

Pandas.read_csv() 忽略包含 FEWER 字段的坏行/行。 文本文件

[英]Pandas.read_csv() ignore bad lines/rows containing FEWER fields. Text file

I am trying to read this huge text file: https://www.dropbox.com/s/3ikikw8bxde6y1i/TCAD_SPECIAL%20EXPORT_2019_20200409.zip?dl=0 (if you download the zip, the file is Special_ARB.txt (not necessary for my question imo). I am trying to read this huge text file: https://www.dropbox.com/s/3ikikw8bxde6y1i/TCAD_SPECIAL%20EXPORT_2019_20200409.zip?dl=0 (if you download the zip, the file is Special_ARB.txt (not necessary for我的问题imo)。

I am running this code (adding error_bad_lines=False ) to ignore lines with more-than-expected fields, which works well:我正在运行此代码(添加error_bad_lines=False )以忽略具有超出预期字段的行,效果很好:

pd.read_csv(r'~/Special_ARB.txt', sep="|", 
            header=None,encoding='cp1252',error_bad_lines=False)

The problem is that read.csv() crashed when a line had only 1 field.问题是当一行只有一个字段时read.csv()崩溃了。 With the following error:出现以下错误:

Too many columns specified: expected 77 and found 1指定的列太多:预期为 77,但找到 1

Is there a way to tell python/pandas to ignore this error?有没有办法告诉 python/pandas 忽略这个错误? It is not letting me know which line it is.它没有让我知道它是哪条线。 There are more than a million rows so I can't just find it on my own.有超过一百万行,所以我不能自己找到它。

  • I tried a for loop to read line by line and figure it out from there, but data is so large that python crashed.我尝试了一个 for 循环逐行读取并从那里弄清楚,但是数据太大以至于 python 崩溃了。
  • The number of columns is 77 which is correctly identify by pandas when running the code, I don't think that's an issue.列数为 77,在运行代码时由 pandas 正确识别,我认为这不是问题。

Thanks,谢谢,

try:
   pd.read_csv(r'~/Special_ARB.txt', sep="|", header=None,encoding='cp1252',error_bad_lines=False)
except <your error description>:
   <do this>

This should work for in-memory datasets, you can use chunking for a solution on large datasets: https://stackoverflow.com/a/59331754/9379924这应该适用于内存数据集,您可以将分块用于大型数据集的解决方案: https://stackoverflow.com/a/59331754/9379924

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM