简体   繁体   English

Python字符编码差异

[英]Python Character Encoding Discrepancy

I am ingesting messages into a pandas DataFrame and attempting to run some machine learning functions on the data. 我正在将消息摄取到pandas DataFrame中,并尝试对数据运行一些机器学习功能。 When I run a tokenisation function I get an error KeyError: "..." basically spits out the content of one of the messages. 当我运行令牌化功能时,出现错误KeyError:“ ...”基本上吐出了其中一条消息的内容。 Looking at the string there utf-8 chars appear such as \\xe2\\x80\\xa8 (space),\\xe2\\x82\\xac (Euro Currency Sign). 查看字符串,出现utf-8字符,例如\\ xe2 \\ x80 \\ xa8(空格),\\ xe2 \\ x82 \\ xac(欧元符号)。 1. Is this the cause of the error? 1.这是错误的原因吗? 2. Why aren't these symbols kept like they appear in the original messages or in the DataFrame? 2.为什么这些符号没有像原始消息或DataFrame中那样保留?

coding=utf-8
from __future__ import print_function
import sys
reload(sys)
sys.setdefaultencoding("utf8")

import os
import pandas as pd

path = '//directory1//'

data = []
for f in [f for f in os.listdir(path) if not f.startswith('.')]:
   with open(path+f, "r") as myfile:
     data.append(myfile.read().replace('\n', ' '))

df = pd.DataFrame(data, columns=["message"])

df["label"] = "1"

path = '//directory2//'
data = []
for f in [f for f in os.listdir(path) if not f.startswith('.')]:
   with open(path+f, "r") as myfile:
     data.append(myfile.read().replace('\n', ' '))

df2 = pd.DataFrame(data, columns=["message"])
df2["label"] = "0"

messages = pd.concat([df,df2], ignore_index=True)

import nltk
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfVectorizer

stopwords = nltk.corpus.stopwords.words('english')

def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_only, ngram_range=(1,2)) # analyzer = word

tfidf_matrix = tfidf_vectorizer.fit_transform(messages.message) #fit the vectorizer to corpora

terms = tfidf_vectorizer.get_feature_names()

totalvocab_tokenized = []

for i in emails.message:
    # x = emails.message[i].decode('utf-8')
    x = unicode(emails.message[i], errors="replace")
    allwords_tokenized = tokenize_only(x)
    totalvocab_tokenized.extend(allwords_tokenized)

vocab_frame = pd.DataFrame({'words': totalvocab_tokenized})
print(vocal_frame)

I tried decoding each message to utf-8, unicode and without those two lines in the last for loop but I keep getting an error. 我尝试将每个消息解码为utf-8,unicode,并且在最后一个for循环中没有这两行,但是我一直收到错误消息。

Any ideas? 有任何想法吗?

Thanks! 谢谢!

  1. It looks like you're printing a repr() of the data. 看来您正在打印数据的repr() If UTF-8 can't be printed, Python may choose to escape it. 如果无法打印UTF-8,Python可能会选择对其进行转义。 Print the actual string or Unicode 打印实际的字符串或Unicode

  2. Get rid of the sys.setdefaultencoding("utf8") and sys reload - it masks issues. 摆脱了sys.setdefaultencoding("utf8")sys重载-它掩盖的问题。 If you get new exceptions, let's investigate those. 如果您发现新的例外情况,让我们进行调查。

  3. Open you text files with automatic decoding. 使用自动解码功能打开文本文件。 Assuming your input is UTF-8: 假设您输入的是UTF-8:

     with io.open(path+f, "r", encoding="utf-8") as myfile: 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM