有关字符编码的Python问题

Question

我正在开发一个程序，该程序需要获取两个文件并将其合并，然后将联合文件写入一个新文件。 问题是输出文件中包含像\\xf0这样的字符，或者如果我更改了某些编码，结果就是类似\(东西。 输入文件在utf8中进行了编码。 我如何在输出文件中打印字符，如"è"或"ò"和"-"

我已经完成了以下代码：

import codecs
import pandas as pd
import numpy as np


goldstandard = "..\\files\file1.csv"
tweets = "..\\files\\file2.csv"

with codecs.open(tweets, "r", encoding="utf8") as t:
    tFile = pd.read_csv(t, delimiter="\t",
                        names=['ID', 'Tweet'],
                        quoting=3)

IDs = tFile['ID']
tweets = tFile['Tweet']

dict = {}
for i in range(len(IDs)):
    dict[np.int64(IDs[i])] = [str(tweets[i])]


with codecs.open(goldstandard, "r", encoding="utf8") as gs:
    for line in gs:
        columns = line.split("\t")
        index = np.int64(columns[0])
        rowValue = dict[index]
        rowValue.append([columns[1], columns[2], columns[3], columns[5]])
        dict[index] = rowValue

import pprint
pprint.pprint(dict)
ndic = pprint.pformat(dict, indent=4)
f = codecs.open("out.csv", "w", "utf8")
f.write(ndic)
f.close()

这是输出示例

   desired: Beyoncè
   obtained: Beyonc\xe9

Answer 1

您正在此处生成Python字符串文字 ：

import pprint
pprint.pprint(dict)
ndic = pprint.pformat(dict, indent=4)

漂亮打印对于产生调试输出很有用。 对象通过repr()传递，以使非可打印和非ASCII字符易于区分和再现：

>>> import pprint
>>> value = u'Beyonc\xe9'
>>> value
u'Beyonc\xe9'
>>> print value
Beyoncé
>>> pprint.pprint(value)
u'Beyonc\xe9'

é字符在Latin-1范围内，在ASCII范围之外，因此用在Python代码中使用时会再次产生相同值的语法表示。

如果要将实际的字符串值写到输出文件中，请不要使用pprint 。 在这种情况下，您必须自己进行格式化。

此外，pandas数据帧将保存字节串 ，而不是unicode对象，因此此时您仍具有未解码的UTF-8数据。

就个人而言，我什至不用在这里使用熊猫。 您似乎想写CSV数据，因此我简化了代码，改用csv模块，而且我实际上并不在这里解码UTF-8（对于这种情况，这是安全的，因为输入和输出都是完全使用UTF-8）：

import csv

tweets = {}
with open(tweets, "rb") as t:
    reader = csv.reader(t, delimiter='\t')
    for id_, tweet in reader:
        tweets[id_] = tweet

with open(goldstandard, "rb") as gs, open("out.csv", 'wb') as outf:
    reader = csv.reader(gs, delimiter='\t')
    writer = csv.reader(outf, delimiter='\t')
    for columns in reader:
        index = columns[0]
        writer.writerow([tweets[index]] + columns[1:4] + [columns[5])

请注意，您确实要避免将dict用作变量名。 它掩盖了内置类型，我改用tweets 。

有关字符编码的Python问题

问题描述

1 个解决方案

解决方案1
3 已采纳 2016-04-28 19:36:30

有关字符编码的Python问题

问题描述

1 个解决方案

解决方案1 3 已采纳 2016-04-28 19:36:30

解决方案1
3 已采纳 2016-04-28 19:36:30