[英]I'm having trouble with code on PyCharm. I'm trying to read a csv file but I'm getting a unicode error where it can't read specific bytes in positions
My code looks like this: I am using PyCharm as my IDE and the csv file I'm using is from MS Excess.我的代码如下所示:我使用 PyCharm 作为我的 IDE,我使用的 csv 文件来自 MS Excess。 I've encoded the csv as UTF-8.
我已将 csv 编码为 UTF-8。 I am trying to read the file using pandas.
我正在尝试使用 pandas 读取文件。 I want to be able to distinquish between objects and ints when I call df.info() This is also why I didn't change it to 'latin-1' or 'ISO...'
当我调用 df.info() 时,我希望能够区分对象和整数这也是我没有将其更改为“latin-1”或“ISO ...”的原因
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
cols = ['sentiment','id','date','query_string','user','text']
df = pd.read_csv("trainingandtestdata\\training.1600000.processed.noemoticon.csv", header=None,
names=cols, encoding='utf-8')#low_memory=False dtype='unicode' encoding='latin1'
df.head()
df.info()
df.sentiment.value_counts()
My error looks like this:我的错误如下所示:
How do I fix the can't decode bytes in position xxxx to xxxx?如何修复 position xxxx 到 xxxx 中无法解码的字节?
"C:\Users\dashg\PycharmProjects\Twitter Sentiment\venv\Scripts\python.exe"
"C:/Users/dashg/PycharmProjects/Twitter Sentiment/Reviewer.py"
Traceback (most recent call last):
File "C:/Users/dashg/PycharmProjects/Twitter Sentiment/Reviewer.py", line 6, in <module>
df = pd.read_csv("trainingandtestdata\\training.1600000.processed.noemoticon.csv", header=None,
names=cols, encoding='utf-8')#low_memory=False dtype='unicode' encoding='latin1'
File "C:\Users\dashg\PycharmProjects\Twitter Sentiment\venv\lib\site-
packages\pandas\io\parsers.py", line 676, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\dashg\PycharmProjects\Twitter Sentiment\venv\lib\site-
packages\pandas\io\parsers.py", line 454, in _read
data = parser.read(nrows)
File "C:\Users\dashg\PycharmProjects\Twitter Sentiment\venv\lib\site-
packages\pandas\io\parsers.py",
line 1133, in read
ret = self._engine.read(nrows)
File "C:\Users\dashg\PycharmProjects\Twitter Sentiment\venv\lib\site-
packages\pandas\io\parsers.py", line 2037, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 860, in pandas._libs.parsers.TextReader.read
File "pandas\_libs\parsers.pyx", line 875, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas\_libs\parsers.pyx", line 929, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 916, in pandas._libs.parsers.TextReader._tokenize_rows
File "pandas\_libs\parsers.pyx", line 2063, in pandas._libs.parsers.raise_parser_error
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 51845-51846: invalid continuation
byte
Process finished with exit code 1
your file doesn't have utf-8
encoding format while your using encoding='utf-8'
in read_csv
method.当您在
read_csv
方法中使用encoding='utf-8'
时,您的文件没有utf-8
编码格式。 use other encoding method to help you solve the problem, like 'latin'
or 'ISO-8859-1'
.使用其他编码方法来帮助您解决问题,例如
'latin'
或'ISO-8859-1'
。 i refer you to this link for help.我向您推荐此链接以寻求帮助。
worst case scenario, if none of this works, you can read the file in 'rb'
mode ( open(file, 'rb')
) and parse it yourself by splitting each line of data using csv delimiter!最坏的情况,如果这些都不起作用,您可以在
'rb'
模式下读取文件( open(file, 'rb')
)并通过使用 csv 分隔符拆分每一行数据来自己解析它!
I was having the same problem, but in my case the solution was really easy.我遇到了同样的问题,但就我而言,解决方案非常简单。 My ide is PyCharm 2020.1 and the.csv have the iso-8859-1 encoding, I've tried everything without luck, so I decided to check my ide config.
My ide is PyCharm 2020.1 and the.csv have the iso-8859-1 encoding, I've tried everything without luck, so I decided to check my ide config. I went to:
我去了:
Its better to save that csv into xlsx and read as最好将 csv 保存到 xlsx 中并读取为
pd.read_excel pd.read_excel
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.