简体   繁体   English

熊猫read_csv无法读取文件中的所有行

[英]Pandas read_csv not reading all rows in file

I am trying to read a csv file with pandas. 我正在尝试使用熊猫读取csv文件。 File has 14993 line after headers. 文件头后有14993行。

data = pd.read_csv(filename, usecols=['tweet', 'Sentiment'])
print(len(data))

it prints : 14900 and if I add one line to the end of file it is now 14901 rows, so it is not because of memory limit etc. And I also tried "error_bad_lines" but nothing has changed. 它打印:14900,如果我在文件末尾添加一行,现在是14901行,因此不是因为内存限制等。而且我也尝试了“ error_bad_lines”,但没有任何改变。

By the name of your headers one can supect that you have free text. 用标题的名称可以认为您有自由文本。 That can easily trip any csv-parser. 这可以轻松触发任何csv解析器。 In any case here's a version that easily allows you to track down inconsistencies in the csv, or at least gives a hint of what to look for… and then puts it into a dataframe. 无论如何,这是一个易于使用的版本,可让您跟踪csv中的不一致之处,或者至少提示要查找的内容……然后将其放入数据帧中。

import csv
import pandas as pd

with open('file.csv') as fc:
    creader = csv.reader(fc) # add settings as needed
    rows = [r for r in creader]
# check consistency of rows
print(len(rows))
print(set((len(r) for r in rows)))
print(tuple(((i, r) for i, r in enumerate(rows) if len(r) == bougus_nbr)))
# find bougus lines and modify in memory, or change csv and re-read it.

# assuming there are headers...
columns = list(zip(*rows))
df = pd.DataFrame({k: v for k, *v in columns if k in ['tweet', 'Sentiment']})

if the dataset is really big, the code should be rewritten to only use generators (which is not that hard to do..). 如果数据集确实很大,则应重写代码以仅使用生成器(这并不难做到。)

Only thing not to forget when using a technique like this is that if you have numbers, those columns should be recasted to suitable datatype if needed, but that becomes self evident if one attempts to do math on a dataframe filled with strings. 使用这种技术时,唯一要记住的是,如果您有数字,则应在需要时将这些列重铸为合适的数据类型,但是如果人们尝试对填充有字符串的数据框进行数学运算,这将变得显而易见。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM