簡體   English   中英

UnicodeDecodeError:“charmap”編解碼器無法解碼 position 1915 中的字節 0x9d:字符映射到<undefined></undefined>

[英]UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1915: character maps to <undefined>

我是 python 的新手。 我有一個 .txt(大小:15,259KB)。 我想加載文件並對其進行處理,但我不斷收到錯誤“UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1915: character maps to”

import nltk
from nltk import FreqDist
from nltk.collocations import *
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
#Read the datasets
path = "C:\\tmp\\FILENAME.txt"
dataset={}
dataset_raw = {}
allFeatures=set()
tot_articles = 0
articles_count={}

N={} # Number of articles in each corpus

for category in categories:
    fileName=path
    f=open(fileName,'r')
    text = ''
    text_raw = ''    
    lines=(f.readlines())
    tot_articles+=len(lines)
    articles_count[category] = len(lines)
    dataset_raw[category] = list(map(lambda line: line.lower(), lines))

    for line in lines:
        text+=line.replace('\n',' ').lower()
        text_raw = line.lower()
    f.close
    N[category]=len(lines)

    tokens = nltk.word_tokenize(text)
    dataset[category] = nltk.Text(tokens)

以下是我得到的錯誤:

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-14-222e94b75803> in <module>
     14     text = ''
     15     text_raw = ''
---> 16     lines=(f.readlines())
     17     tot_articles+=len(lines)
     18     articles_count[category] = len(lines)

~\Anaconda3\lib\encodings\cp1252.py in decode(self, input, final)
     21 class IncrementalDecoder(codecs.IncrementalDecoder):
     22     def decode(self, input, final=False):
---> 23         return codecs.charmap_decode(input,self.errors,decoding_table)[0]
     24 
     25 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1915: character maps to <undefined>

嘗試在打開文件時指定編碼:

例如:

f=open(fileName,'r', encoding="utf8")

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM