繁体   English   中英

Python字符编码差异

[英]Python Character Encoding Discrepancy

我正在将消息摄取到pandas DataFrame中,并尝试对数据运行一些机器学习功能。 当我运行令牌化功能时,出现错误KeyError:“ ...”基本上吐出了其中一条消息的内容。 查看字符串,出现utf-8字符,例如\\ xe2 \\ x80 \\ xa8(空格),\\ xe2 \\ x82 \\ xac(欧元符号)。 1.这是错误的原因吗? 2.为什么这些符号没有像原始消息或DataFrame中那样保留?

coding=utf-8
from __future__ import print_function
import sys
reload(sys)
sys.setdefaultencoding("utf8")

import os
import pandas as pd

path = '//directory1//'

data = []
for f in [f for f in os.listdir(path) if not f.startswith('.')]:
   with open(path+f, "r") as myfile:
     data.append(myfile.read().replace('\n', ' '))

df = pd.DataFrame(data, columns=["message"])

df["label"] = "1"

path = '//directory2//'
data = []
for f in [f for f in os.listdir(path) if not f.startswith('.')]:
   with open(path+f, "r") as myfile:
     data.append(myfile.read().replace('\n', ' '))

df2 = pd.DataFrame(data, columns=["message"])
df2["label"] = "0"

messages = pd.concat([df,df2], ignore_index=True)

import nltk
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfVectorizer

stopwords = nltk.corpus.stopwords.words('english')

def tokenize_only(text):
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        if re.search('[a-zA-Z]', token):
            filtered_tokens.append(token)
    return filtered_tokens

tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_only, ngram_range=(1,2)) # analyzer = word

tfidf_matrix = tfidf_vectorizer.fit_transform(messages.message) #fit the vectorizer to corpora

terms = tfidf_vectorizer.get_feature_names()

totalvocab_tokenized = []

for i in emails.message:
    # x = emails.message[i].decode('utf-8')
    x = unicode(emails.message[i], errors="replace")
    allwords_tokenized = tokenize_only(x)
    totalvocab_tokenized.extend(allwords_tokenized)

vocab_frame = pd.DataFrame({'words': totalvocab_tokenized})
print(vocal_frame)

我尝试将每个消息解码为utf-8,unicode,并且在最后一个for循环中没有这两行,但是我一直收到错误消息。

有任何想法吗?

谢谢!

  1. 看来您正在打印数据的repr() 如果无法打印UTF-8,Python可能会选择对其进行转义。 打印实际的字符串或Unicode

  2. 摆脱了sys.setdefaultencoding("utf8")sys重载-它掩盖的问题。 如果您发现新的例外情况,让我们进行调查。

  3. 使用自动解码功能打开文本文件。 假设您输入的是UTF-8:

     with io.open(path+f, "r", encoding="utf-8") as myfile: 

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM