简体   繁体   English

Anaconda:UnicodeDecodeError:'utf8'编解码器无法解码位置1412中的字节0x92:无效的起始字节

[英]Anaconda: UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 1412: invalid start byte

I want to calculate TF_IDF for a set of documents (10). 我想为一组文档计算TF_IDF(10)。 I use Python Anaconda for this. 我为此使用Python Anaconda。

import nltk
import string
import os

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer

path = '/opt/datacourse/data/parts'
token_dict = {}
stemmer = PorterStemmer()

def stem_tokens(tokens, stemmer):
    stemmed = []
for item in tokens:
    stemmed.append(stemmer.stem(item))
return stemmed

def tokenize(text):
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

for subdir, dirs, files in os.walk(path):
    for file in files:
    file_path = subdir + os.path.sep + file
    shakes = open(file_path, 'r')
    text = shakes.read()
    lowers = text.lower()
    no_punctuation = lowers.translate(None, string.punctuation)
    token_dict[file] = no_punctuation

    tfidf = TfidfVectorizer(tokenizer=tokenize, stop_words='english')
    tfs = tfidf.fit_transform(token_dict.values())

But after printing tfs = tfidf.fit_transform(token_dict.values()) I get the following error message. 但是在打印tfs = tfidf.fit_transform(token_dict.values())我得到以下错误消息。

UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 1412: invalid start byte

How do I fix this error? 如何解决此错误?

I was using same reference for data preprocessing and got the exactly same error. 我在数据预处理中使用了相同的参考,但得到了完全相同的错误。 These are several steps which I took and got perfectly working code on Pyhton 2.7 on Ubuntu 14.04 Machine, 这些是我采取的几个步骤,并在Ubuntu 14.04 Machine上的Pyhton 2.7上获得了完美的工作代码,

1) Use "codecs" to open file and set "encoding" parameter as ISO-8859-1. 1)使用“编解码器”打开文件,并将“编码”参数设置为ISO-8859-1。 Here is how you do it 这是你的做法

import codecs
with codecs.open(pathToYourFileWithFileName,"r",encoding = "ISO-8859-1") as file_handle:

2) As you do this first step, you bump into 2nd problem while using 2)第一步时,您在使用时遇到第二个问题

no_punctuation = lowers.translate(None, string.punctuation)

which is explained here string.translate() with unicode data in python 这在这里解释了string.translate()与python中的unicode数据

The solution will go like 解决方案将像

lowers = text.lower()
remove_punctuation_map = dict((ord(char), None) for char in string.punctuation)
no_punctuation = lowers.translate(remove_punctuation_map)

I hope it helps. 希望对您有所帮助。

Your data is encoded with other encoding :) 您的数据使用其他编码进行编码:)

To decode data in string, use the following 要解码字符串数据,请使用以下命令

myvar.decode("ENCODING")

Where encoding can be any encoding name. 编码可以是任何编码名称。 That function is doing it in background, decoding on "utf-8". 该功能在后台执行,在“ utf-8”上解码。

You should try "latin1" or "latin2"; 您应该尝试使用“ latin1”或“ latin2”; both of them, with utf-8 are the most common used 两者都使用utf-8,是最常用的

Cheers 干杯

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 UnicodeDecodeError:“ utf8”编解码器无法解码位置661中的字节0x92:无效的起始字节 - UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 661: invalid start byte UnicodeDecodeError'utf-8'编解码器无法解码位置2893中的字节0x92:无效的起始字节 - UnicodeDecodeError 'utf-8' codec can't decode byte 0x92 in position 2893: invalid start byte 我不断收到 UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 1: invalid start byte - I keep getting UnicodeDecodeError: 'utf-8' codec can't decode byte 0x92 in position 1: invalid start byte Python错误:“ utf8”编解码器无法解码位置85的字节0x92:无效的起始字节 - Python error: 'utf8' codec can't decode byte 0x92 in position 85: invalid start byte “utf-8”编解码器无法解码 position 107 中的字节 0x92:无效的起始字节 - 'utf-8' codec can't decode byte 0x92 in position 107: invalid start byte “utf-8”编解码器无法解码位置 11 中的字节 0x92:起始字节无效 - 'utf-8' codec can't decode byte 0x92 in position 11: invalid start byte “utf-8”编解码器无法解码 position 18 中的字节 0x92:无效的起始字节 - 'utf-8' codec can't decode byte 0x92 in position 18: invalid start byte 使用 CSVLogger 时出错:“utf-8”编解码器无法解码位置 144 中的字节 0x92:起始字节无效 - Error using CSVLogger: 'utf-8' codec can't decode byte 0x92 in position 144: invalid start byte UnicodeDecodeError:'utf8'编解码器无法解码位置11的字节0x80:无效的起始字节 - UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 11: invalid start byte 将查询结果写入CSV时,“ utf8”编解码器无法解码字节0x92 - 'utf8' codec can't decode byte 0x92 when writing query results to csv
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM