简体   繁体   English

弄乱我的unicode输出-但是在哪里以及如何进行?

[英]Messing up my unicode output - but where and how?

I am doing a word count on some text files, storing the results in a dictionary. 我正在对一些文本文件进行字数统计,并将结果存储在字典中。 My problem is that after outputting to file, the words are not displayed right even if they were in the original text. 我的问题是,输出到文件后,这些单词即使在原始文本中也无法正确显示。 (I use TextWrangler to look at them). (我使用TextWrangler来查看它们)。 For instance, dashes show up as dashes in the original but as \— in the output; 例如,破折号在原始文件中显示为破折号,但在输出中显示为\\ u2014 in the output, very word is prefixed by a u as well. 在输出中,单词也以u为前缀。

Problem 问题

I do not know where, when and how in my script this happens. 我不知道在脚本中的什么位置,时间和方式。

I am reading the file with codecs.open() and outputting them with codecs.open() and as json.dump() . 我读与文件codecs.open()并与它们输出codecs.open()json.dump() They both go wrong in the same way. 它们都以相同的方式出错。 In between, all is do is 在这之间,所有要做的就是

  1. tokenizing 标记化

  2. regular expressions 常用表达

  3. collect in dictionary 收集字典

And I don't know where I mess things up; 而且我不知道我把事情搞砸了。 I have de-activated tokenizing and most other functions to no effect. 我没有激活令牌化和其他大多数功能。 All this is happening in Python 2. Following previous advice, I tried to keep everything within the script in Unicode. 所有这些都是在Python 2中发生的。按照先前的建议,我尝试将所有内容保留在Unicode脚本中。

Here is what I do (non-relevant code omitted): 这是我的工作(省略了无关代码):

#read in file, iterating over a list of "fileno"s
with codecs.open(os.path.join(dir,unicode(fileno)+".txt"), "r", "utf-8") as inputfili:
            inputtext=inputfili.read()

#process the text: tokenize, lowercase, remove punctuation and conjugation
content=regular expression to extract text w/out metadata
contentsplit=nltk.tokenize.word_tokenize(content)
text=[i.lower() for i in contentsplit if not re.match(r"\d+", i)]
text= [re.sub(r"('s|s|s's|ed)\b", "", i) for i in text if i not in string.punctuation]

#build the dictionary of word counts
for word in text:
    dicti[word].append(word)

#collect counts for each word, make dictionary of unique words
dicti_nos={unicode(k):len(v) for k,v in dicti.items()}
hapaxdicti= {k:v for k,v in perioddicti_nos.items() if v == 1}

#sort the dictionary
sorteddict=sorted(dictionary.items(), key=lambda x: x[1], reverse=True)

#output the results as .txt and json-file
with codecs.open(file_name, "w", "utf-8") as outputi:
    outputi.write("\n".join([unicode(i) for i in sorteddict]))
with open(file_name+".json", "w") as jsonoutputi:
    json.dump(dictionary, jsonoutputi,  encoding="utf-8")

EDIT: Solution 编辑:解决方案

Looks like my main issue was writing the file in the wrong way. 看来我的主要问题是用错误的方式写入文件。 If I change my code to what's reproduced below, things work out. 如果我将代码更改为以下内容,则说明一切正常。 Looks like joining a list of (string, number) tuples messed the string part up; 看起来像是加入了一个(字符串,数字)元组列表,把字符串部分弄乱了; if I join the tuples first, things work. 如果我先加入元组,一切都会正常。

For the json output, I had to change to codecs.open() and set ensure_ascii to False . 对于json输出,我必须更改为codecs.open()并将ensure_ascii设置为False Apparently just setting the encoding to utf-8 does not do the trick like I thought. 显然,仅将encoding设置为utf-8并不能达到我的想法。

with codecs.open(file_name, "w", "utf-8") as outputi:
    outputi.write("\n".join([":".join([i[0],unicode(i[1])]) for i in sorteddict]))

with codecs.open(file_name+".json", "w", "utf-8") as jsonoutputi:
    json.dump(dictionary, jsonoutputi,  ensure_ascii=False)

Thanks for your help! 谢谢你的帮助!

As your example is partially pseudocode there's no way to run a real test and give you something that runs and has been tested, but from reading what you have provided I think you may misunderstand the way Unicode works in Python 2. 由于您的示例是部分伪代码,因此无法运行真正的测试,也无法为您提供已运行且已经过测试的内容,但是通过阅读您提供的内容,我认为您可能会误解Unicode在Python 2中的工作方式。

The unicode type (such as is produced via the unicode() or unichr() functions) is meant to be an internal representation of a Unicode string that can be used for string manipulation and comparison purposes. unicode类型(例如通过unicode()unichr()函数产生的类型)是Unicode字符串的内部表示形式,可用于字符串操作和比较目的。 It has no associated encoding. 没有关联的编码。 The unicode() function will take a buffer as its first argument and an encoding as its second argument and interpret that buffer using that encoding to produce an internally usable Unicode string that is from that point forward unencumbered by encodings. unicode()函数将缓冲区作为第一个参数,将编码作为第二个参数,并使用该编码解释该缓冲区,以产生一个内部可用的Unicode字符串,此字符串从那时开始不受编码的约束。

That Unicode string isn't meant to be written out to a file; 该Unicode字符串并不是要写到文件中。 all file formats assume some encoding, and you're supposed to provide one again before writing that Unicode string out to a file. 所有文件格式都采用某种编码,并且您应该在将Unicode字符串写到文件之前再次提供一种编码。 Everyplace you have a construct like unicode(fileno) or unicode(k) or unicode(i) is suspect both because you're relying on a default encoding (which probably isn't what you want) and because you're going on to expose most of these values directly to the file system. 您在每个地方都拥有unicode(fileno)unicode(k)unicode(i)结构的人都怀疑这是因为您所依赖的是默认编码(可能不是您想要的),并且因为您将继续将这些值中的大多数直接暴露给文件系统。

After you're done working with these Unicode strings you can use the built-in method encode() on them with your desired encoding as an argument to pack them into strings of ordinary bytes set as required by your encoding. 处理完这些Unicode字符串后,您可以对它们使用内置方法encode()以及所需的编码作为参数,以将它们打包为编码所需的普通字节字符串。

So looking back at your example above, your inputtext variable is an ordinary string containing data encoded per the UTF-8 encoding. 因此,回顾上面的示例,您的inputtext变量是一个普通字符串,其中包含按照UTF-8编码编码的数据。 This isn't Unicode. 这不是Unicode。 You could convert it to a Unicode string with an operation like inputuni = unicode(inputtext, 'utf-8') and operate on it like that if you chose, but for what you're doing you may not even find it necessary. 您可以使用inputuni = unicode(inputtext, 'utf-8')之类的操作将其转换为Unicode字符串,并根据需要对其进行操作,但是对于您正在执行的操作,您甚至可能没有必要。 If you did convert it to Unicode though you'd have to perform the equivalent of a inputuni.encode('UTF-8') on any Unicode string that you were planning on writing out to your file. 如果确实将其转换为Unicode,则必须对要写出到文件中的任何Unicode字符串执行与inputuni.encode('UTF-8')等效的inputuni.encode('UTF-8')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM