使pandas.read_csv（）在csv文件的开头忽略垃圾？

Question

I've got some junk at the start of my csv file that prevents me selecting the first column of my dataframe by name. 我的csv文件开头出现一些垃圾，无法按名称选择数据框的第一列。

Example: 例：

In[1]: df = pd.read_csv('file:inputdata.csv', usecols=[0], nrows=1)

In[2]: df
Out[2]:
        TAB
0  10-LV_Non

In[3]: df['TAB']
Out[3]: <snip> KeyError: 'TAB'

I found the junk by reading the file with open(): 我通过使用open（）读取文件发现了垃圾：

In[4]: with open('inputdata.csv', 'rb') as f:
           print(f.read(7))
Out[4]: b'\xef\xbb\xbfTAB,'

EDIT: '\\xef\\xbb\\xbf' is three bytes of junk. 编辑： '\\xef\\xbb\\xbf'是三个字节的垃圾。 'TAB' is the name of the first column. 'TAB'是第一列的名称。

Is there a way to make pandas.read_csv() ignore junks like this (if present) at the start of the csv file? 有没有办法让pandas.read_csv()在csv文件的开头忽略这样的垃圾（如果存在）？

NB The csv files are exported from a proprietary system, so I can't control their format. 注意：csv文件是从专有系统导出的，因此我无法控制其格式。

UPDATE: Here's my solution, based on Mike Müller's answer: 更新：这是我的解决方案，基于MikeMüller的回答：

with open('inputdata.csv', 'r') as f:
    # Skip past any bytes that aren't text
    while re.match('[a-zA-Z0-9_]', f.read(1)) is None:
        pass
    # Seek back one byte
    f.seek(f.tell()-1)
    # Read the file
    df = pd.read_csv(f, usecols=['TAB'])

Answer 1

It's unclear to me what exactly is the format of the "junk", but there are a number of options to use. 我不清楚“垃圾”的格式是什么，但是有很多选择可以使用。

pandas.read_csv takes a filepath_or_buffer pandas.read_csv需要一个文件filepath_or_buffer pandas.read_csv

filepath_or_buffer : string or file handle / StringIO filepath_or_buffer：字符串或文件句柄/ StringIO

It follows that if you open a File object , read past the junk, then pass the File object to read_csv , it should be OK. 因此，如果您open File对象，读取了垃圾内容，然后将File对象传递给read_csv ，则应该可以。

The skiprows arguments skips rows: skiprows参数跳过行：

skiprows : list-like or integer, default None skiprows：类似列表或整数，默认为无

Thus you can possibly skip the junk's row(s). 因此，您可以跳过垃圾的行。

Answer 2

Something like this could work: 这样的事情可能会起作用：

with open('inputdata.csv', 'rb') as f:
    if f.read(7) != b'\xef\xbb\xbfTAB,':
        f.seek(0)
    df = pd.read_csv(f, usecols=[0], nrows=1)

Just read the first seven bytes. 只需读取前七个字节。 If the are good, ie not equal to the bytes you don't want, go back to the beginning of the file with seek(0) , otherwise start reading at position 7 bytes, skipping the offending bytes. 如果很好，即不等于您不想要的字节，请使用seek(0)返回文件的开头，否则从位置7个字节开始读取，跳过有问题的字节。

使pandas.read_csv（）在csv文件的开头忽略垃圾？

问题描述

2 个解决方案

解决方案1
3 2016-01-21 19:02:16

解决方案2
1 2016-01-21 19:05:17

使pandas.read_csv（）在csv文件的开头忽略垃圾？

问题描述

2 个解决方案

解决方案1 3 2016-01-21 19:02:16

解决方案2 1 2016-01-21 19:05:17

解决方案1
3 2016-01-21 19:02:16

解决方案2
1 2016-01-21 19:05:17