简体   繁体   English

使pandas.read_csv()在csv文件的开头忽略垃圾?

[英]Make pandas.read_csv() ignore junk at the start of the csv files?

I've got some junk at the start of my csv file that prevents me selecting the first column of my dataframe by name. 我的csv文件开头出现一些垃圾,无法按名称选择数据框的第一列。

Example: 例:

In[1]: df = pd.read_csv('file:inputdata.csv', usecols=[0], nrows=1)

In[2]: df
Out[2]:
        TAB
0  10-LV_Non

In[3]: df['TAB']
Out[3]: <snip> KeyError: 'TAB'

I found the junk by reading the file with open(): 我通过使用open()读取文件发现了垃圾:

In[4]: with open('inputdata.csv', 'rb') as f:
           print(f.read(7))
Out[4]: b'\xef\xbb\xbfTAB,'

EDIT: '\\xef\\xbb\\xbf' is three bytes of junk. 编辑: '\\xef\\xbb\\xbf'是三个字节的垃圾。 'TAB' is the name of the first column. 'TAB'是第一列的名称。

Is there a way to make pandas.read_csv() ignore junks like this (if present) at the start of the csv file? 有没有办法让pandas.read_csv()在csv文件的开头忽略这样的垃圾(如果存在)?

NB The csv files are exported from a proprietary system, so I can't control their format. 注意:csv文件是从专有系统导出的,因此我无法控制其格式。

UPDATE: Here's my solution, based on Mike Müller's answer: 更新:这是我的解决方案,基于MikeMüller的回答:

with open('inputdata.csv', 'r') as f:
    # Skip past any bytes that aren't text
    while re.match('[a-zA-Z0-9_]', f.read(1)) is None:
        pass
    # Seek back one byte
    f.seek(f.tell()-1)
    # Read the file
    df = pd.read_csv(f, usecols=['TAB'])

It's unclear to me what exactly is the format of the "junk", but there are a number of options to use. 我不清楚“垃圾”的格式是什么,但是有很多选择可以使用。


pandas.read_csv takes a filepath_or_buffer pandas.read_csv需要一个文件filepath_or_buffer pandas.read_csv

filepath_or_buffer : string or file handle / StringIO filepath_or_buffer:字符串或文件句柄/ StringIO

It follows that if you open a File object , read past the junk, then pass the File object to read_csv , it should be OK. 因此,如果您open File对象 ,读取了垃圾内容,然后将File对象传递给read_csv ,则应该可以。


The skiprows arguments skips rows: skiprows参数跳过行:

skiprows : list-like or integer, default None skiprows:类似列表或整数,默认为无

Thus you can possibly skip the junk's row(s). 因此,您可以跳过垃圾的行。

Something like this could work: 这样的事情可能会起作用:

with open('inputdata.csv', 'rb') as f:
    if f.read(7) != b'\xef\xbb\xbfTAB,':
        f.seek(0)
    df = pd.read_csv(f, usecols=[0], nrows=1)

Just read the first seven bytes. 只需读取前七个字节。 If the are good, ie not equal to the bytes you don't want, go back to the beginning of the file with seek(0) , otherwise start reading at position 7 bytes, skipping the offending bytes. 如果很好,即不等于您不想要的字节,请使用seek(0)返回文件的开头,否则从位置7个字节开始读取,跳过有问题的字节。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM