Pandas read_csv not reading all rows in file

Question

I am trying to read a csv file with pandas. File has 14993 line after headers.

data = pd.read_csv(filename, usecols=['tweet', 'Sentiment'])
print(len(data))

it prints : 14900 and if I add one line to the end of file it is now 14901 rows, so it is not because of memory limit etc. And I also tried "error_bad_lines" but nothing has changed.

Answer 1

By the name of your headers one can supect that you have free text. That can easily trip any csv-parser. In any case here's a version that easily allows you to track down inconsistencies in the csv, or at least gives a hint of what to look for… and then puts it into a dataframe.

import csv
import pandas as pd

with open('file.csv') as fc:
    creader = csv.reader(fc) # add settings as needed
    rows = [r for r in creader]
# check consistency of rows
print(len(rows))
print(set((len(r) for r in rows)))
print(tuple(((i, r) for i, r in enumerate(rows) if len(r) == bougus_nbr)))
# find bougus lines and modify in memory, or change csv and re-read it.

# assuming there are headers...
columns = list(zip(*rows))
df = pd.DataFrame({k: v for k, *v in columns if k in ['tweet', 'Sentiment']})

if the dataset is really big, the code should be rewritten to only use generators (which is not that hard to do..).

Only thing not to forget when using a technique like this is that if you have numbers, those columns should be recasted to suitable datatype if needed, but that becomes self evident if one attempts to do math on a dataframe filled with strings.

Pandas read_csv not reading all rows in file

Question

1 answers

solution1
1 2019-03-03 22:27:40

Pandas read_csv not reading all rows in file

Question

1 answers

solution1 1 2019-03-03 22:27:40

solution1
1 2019-03-03 22:27:40