简体   繁体   English

熊猫:如何在同一个单元格中读取多行的csv?

[英]Pandas: how to read csv with multiple lines on the same cell?

I have a csv that I am not able to read using read_csv Opening the csv with sublime text shows something like: 我有一个csv ,我无法使用read_csv读取打开csv与sublime文本显示如下:

col1,col2,col3
text,2,3
more text,3,4
HELLO

THIS IS FUN
,3,4

As you can see, the text HELLO THIS IS FUN takes three lines, and pd.read_csv is confused as it thinks these are three new observations. 正如你所看到的,文本HELLO THIS IS FUN需要三行,并且pd.read_csv被混淆,因为它认为这是三个新观察。 How can I parse that correctly in Pandas? 我怎样才能在Pandas中正确解析?

Thanks! 谢谢!

It looks like you'll have to preprocess the data manually: 您似乎必须手动预处理数据:

with open('data.csv','r') as f:
    lines = f.read().splitlines()
processed = []
cum_c = 0
buffer = ''
for line in lines:
    buffer += line # Append the current line to a buffer
    c = buffer.count(',')
    if cum_c == 2:
        processed.append(line)
        buffer = ''
    elif cum_c > 2:
        raise # This should never happen

This assumes that your data only contains unwanted newlines, eg if you had data with say, 3 elements in one row, 2 elements in the next, then the next row should either be blank or contain only 1 element. 这假设您的数据仅包含不需要的换行符,例如,如果您有数据,例如,一行中有3个元素,下一行中有2个元素,则下一行应为空白或仅包含1个元素。 If it has 2 or more, ie it's missing a necessary newline, then an error is thrown. 如果它有2个或更多,即它缺少必要的换行符,则抛出错误。 You can accommodate this case if necessary with a minor modification. 如有必要,您可以通过微小的修改来适应这种情况。

Actually, it might be more efficient to remove newlines instead, but it shouldn't matter unless you have a lot of data. 实际上,删除换行可能更有效,但除非您拥有大量数据,否则无关紧要。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM