[英]Load a column from a TSV file into a python list
I want to load the values from the "category" column into a pandas df, this is my tsv file:我想将“类别”列中的值加载到熊猫 df 中,这是我的 tsv 文件:
Tagname text category
j245qzx_8 hamburger toppings f
h833uio_7 side of fries f
d423jin_2 milkshake combo d
This is my code:这是我的代码:
with open(filename, 'r') as f:
df = pd.read_csv(f, sep='\t')
categoryColumn = df["category"]
categoryList = []
for line in categoryColumn:
categoryColumn.append(line)
However I get a UnicodeDecodeError for the line df = pd.read_csv(f, sep='\\t')
and my code stops there:但是,我得到
df = pd.read_csv(f, sep='\\t')
行的UnicodeDecodeError并且我的代码停在那里:
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 440, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 787, in __init__
self._make_engine(self.engine)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1014, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/anaconda3/lib/python3.6/site-packages/pandas/io/parsers.py", line 1708, in __init__
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 539, in pandas._libs.parsers.TextReader.__cinit__
File "pandas/_libs/parsers.pyx", line 737, in pandas._libs.parsers.TextReader._get_header
File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas/_libs/parsers.pyx", line 2101, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x89 in position 898: invalid start byte
Any ideas why or how to fix this?任何想法为什么或如何解决这个问题? It doesn't seem like there's any special characters in my tsv so I'm not sure what's causing this or what to do.
我的 tsv 中似乎没有任何特殊字符,所以我不确定是什么导致了这种情况或该怎么做。
The fix修复
Just read this SO , and I think I see what's wrong.只需阅读此 SO ,我想我明白出了什么问题。
You're getting a file handle with Python's open()
and passing that to Pandas's read_csv()
.您将使用 Python 的
open()
获取文件句柄并将其传递给 Pandas 的read_csv()
。 open()
determines the file's encoding. open()
确定文件的编码。
So, try setting the encoding in open()
, like this:因此,尝试在
open()
设置编码,如下所示:
with open(filename, 'r', encoding='windows-1252') as f:
df = pd.read_csv(f, sep='\t')
categoryColumn = df["category"]
categoryList = []
for line in categoryColumn:
categoryColumn.append(line)
Or, don't use open()
at all:或者,根本不使用
open()
:
df = pd.read_csv(filename, sep='\t', encoding='windows-1252')
categoryColumn = df["category"]
categoryList = []
for line in categoryColumn:
categoryColumn.append(line)
Some of the back story一些背景故事
I echo'ed x89
into the end of your sample, then ran Python's chardetect
utility, and it's suggesting it's Window-1252:我将
x89
到示例的末尾,然后运行 Python 的chardetect
实用程序,它表明它是 Window-1252:
% echo -e '\x89' >> sample.csv
% cat sample.csv
Tagname text category
j245qzx_8 hamburger toppings f
h833uio_7 side of fries f
d423jin_2 milkshake combo d
�
% which chardetect
/Library/Frameworks/Python.framework/Versions/3.9/bin/chardetect
% chardetect sample.csv
sample.csv: Windows-1252 with confidence 0.73
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.