简体   繁体   English

将 TSV 文件中的列加载到 python 列表中

[英]Load a column from a TSV file into a python list

I want to load the values from the "category" column into a pandas df, this is my tsv file:我想将“类别”列中的值加载到熊猫 df 中,这是我的 tsv 文件:

Tagname   text  category
j245qzx_8   hamburger toppings   f
h833uio_7   side of fries   f
d423jin_2   milkshake combo   d

This is my code:这是我的代码:

with open(filename, 'r') as f:
    df = pd.read_csv(f, sep='\t')
    categoryColumn = df["category"]

    categoryList = []
    for line in categoryColumn:
        categoryColumn.append(line)

However I get a UnicodeDecodeError for the line df = pd.read_csv(f, sep='\\t') and my code stops there:但是,我得到df = pd.read_csv(f, sep='\\t')行的UnicodeDecodeError并且我的代码停在那里:

File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 440, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 787, in __init__
    self._make_engine(self.engine)
  File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1014, in _make_engine
    self._engine = CParserWrapper(self.f, **self.options)
  File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1708, in __init__
    self._reader = parsers.TextReader(src, **kwds)
  File "pandas/_libs/parsers.pyx", line 539, in pandas._libs.parsers.TextReader.__cinit__
  File "pandas/_libs/parsers.pyx", line 737, in pandas._libs.parsers.TextReader._get_header
  File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2101, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 898: invalid start byte

Any ideas why or how to fix this?任何想法为什么或如何解决这个问题? It doesn't seem like there's any special characters in my tsv so I'm not sure what's causing this or what to do.我的 tsv 中似乎没有任何特殊字符,所以我不确定是什么导致了这种情况或该怎么做。

The fix修复

Just read this SO , and I think I see what's wrong.只需阅读此 SO ,我想我明白出了什么问题。

You're getting a file handle with Python's open() and passing that to Pandas's read_csv() .您将使用 Python 的open()获取文件句柄并将其传递给 Pandas 的read_csv() open() determines the file's encoding. open()确定文件的编码。

So, try setting the encoding in open() , like this:因此,尝试在open()设置编码,如下所示:

with open(filename, 'r', encoding='windows-1252') as f:
    df = pd.read_csv(f, sep='\t')
    categoryColumn = df["category"]

    categoryList = []
    for line in categoryColumn:
        categoryColumn.append(line)

Or, don't use open() at all:或者,根本不使用open()

df = pd.read_csv(filename, sep='\t', encoding='windows-1252')
categoryColumn = df["category"]

categoryList = []
for line in categoryColumn:
    categoryColumn.append(line)

Some of the back story一些背景故事

I echo'ed x89 into the end of your sample, then ran Python's chardetect utility, and it's suggesting it's Window-1252:我将x89到示例的末尾,然后运行 ​​Python 的chardetect实用程序,它表明它是 Window-1252:

% echo -e '\x89' >> sample.csv

% cat sample.csv 
Tagname text    category
j245qzx_8       hamburger toppings      f
h833uio_7       side of fries   f
d423jin_2       milkshake combo d
�

% which chardetect
/Library/Frameworks/Python.framework/Versions/3.9/bin/chardetect

% chardetect sample.csv 
sample.csv: Windows-1252 with confidence 0.73

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM