在CSV文件中解码UTF8文字

Question

题：

有谁知道我该如何将这b"it\\\\xe2\\\\x80\\\\x99s time to eat" it's time to eat b"it\\\\xe2\\\\x80\\\\x99s time to eat"转变it's time to eat

更多详细信息和我的代码：

大家好，

我目前正在处理一个CSV文件，其中包含满是UTF8文字的行，例如：

b“吃饭的时间到了\\ xe2 \\ x80 \\ x99s”

最终目标是得到这样的东西：

该吃饭了

为此，我尝试使用以下代码：

import pandas as pd


file_open = pd.read_csv("/Users/Downloads/tweets.csv")

file_open["text"]=file_open["text"].str.replace("b\'", "")

file_open["text"]=file_open["text"].str.encode('ascii').astype(str)

file_open["text"]=file_open["text"].str.replace("b\"", "")[:-1]

print(file_open["text"])

运行代码后，以我作为示例的行输出为：

是时候吃饭了

我尝试使用以下代码打开CSV文件来解决此问题：

file_open = pd.read_csv("/Users/Downloads/tweets.csv", encoding = "utf-8")

它以以下方式打印示例行：

是时候吃饭了

而且我还尝试使用此方法解码行：

file_open["text"]=file_open["text"].str.decode('utf-8')

这给了我以下错误：

AttributeError: Can only use .str accessor with string values, which use np.object_ dtype in pandas

预先非常感谢您的帮助。

Answer 1

b"it\\\\xe2\\\\x80\\\\x99s time to eat"听起来像您的文件包含转义的编码。

通常，您可以使用以下代码将其转换为正确的Python3字符串：

x = b"it\\xe2\\x80\\x99s time to eat"
x = x.decode('unicode-escape').encode('latin1').decode('utf8')
print(x)     # it’s time to eat

（在此说明了.encode('latin1')的.encode('latin1') ）

因此，如果在使用pd.read_csv(..., encoding="utf8")仍然转义了字符串，则可以执行以下操作：

pd.read_csv(..., encoding="unicode-escape")
# ...
# Now, your values will be strings but improperly decoded:
#    itâs time to eat
#
# So we encode to bytes then decode properly:
val = val.encode('latin1').decode('utf8')
print(val)   # it’s time to eat

但是我认为最好对整个文件而不是对每个值分别进行处理，例如使用StringIO（如果文件不太大）：

from io import StringIO

# Read the csv file into a StringIO object
sio = StringIO()
with open('yourfile.csv', 'r', encoding='unicode-escape') as f:
    for line in f:
        line = line.encode('latin1').decode('utf8')
        sio.write(line)
sio.seek(0)    # Reset file pointer to the beginning

# Call read_csv, passing the StringIO object
df = pd.read_csv(sio, encoding="utf8")

在CSV文件中解码UTF8文字

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-06-24 17:22:42

在CSV文件中解码UTF8文字

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-06-24 17:22:42

解决方案1
2 已采纳 2018-06-24 17:22:42