Pandas: how to read csv with multiple lines on the same cell?

Question

I have a csv that I am not able to read using read_csv Opening the csv with sublime text shows something like:

col1,col2,col3
text,2,3
more text,3,4
HELLO

THIS IS FUN
,3,4

As you can see, the text HELLO THIS IS FUN takes three lines, and pd.read_csv is confused as it thinks these are three new observations. How can I parse that correctly in Pandas?

Thanks!

Answer 1

It looks like you'll have to preprocess the data manually:

with open('data.csv','r') as f:
    lines = f.read().splitlines()
processed = []
cum_c = 0
buffer = ''
for line in lines:
    buffer += line # Append the current line to a buffer
    c = buffer.count(',')
    if cum_c == 2:
        processed.append(line)
        buffer = ''
    elif cum_c > 2:
        raise # This should never happen

This assumes that your data only contains unwanted newlines, eg if you had data with say, 3 elements in one row, 2 elements in the next, then the next row should either be blank or contain only 1 element. If it has 2 or more, ie it's missing a necessary newline, then an error is thrown. You can accommodate this case if necessary with a minor modification.

Actually, it might be more efficient to remove newlines instead, but it shouldn't matter unless you have a lot of data.

Pandas: how to read csv with multiple lines on the same cell?

Question

1 answers

solution1
1 2017-05-04 09:05:13

Pandas: how to read csv with multiple lines on the same cell?

Question

1 answers

solution1 1 2017-05-04 09:05:13

solution1
1 2017-05-04 09:05:13