[英]UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte
[英]UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte in my code
当我尝试添加以下代码时,它给我一个错误。
我已经安装了每个Python模块,包括nltk
。 我添加了lxml nampy
,但是它不起作用。 我正在使用python3 ,在这种情况下,我已将urllib2
更改为urllib.requests
。
请帮助我找到解决方案。
我正在运行这个
python index.py
我的索引文件如下。 这是代码:
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import ssl
import os
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import codecs
def checkChar(token):
for char in token:
if(0 <= ord(char) and ord(char) <= 64) or (91 <= ord(char) and ord(char) <= 96) or (123 <= ord(char)):
return False
else:
continue
return True
def cleanMe(html):
soup = BeautifulSoup(html, "html.parser")
for script in soup(["script, style"]):
script.extract()
text = soup.get_text()
lines = (line.strip() for line in text.splitlines())
chunks = (phrase.strip() for line in lines for phrase in line.split(" "))
text = '\n'.join(chunk for chunk in chunks if chunk)
return text
path = 'crawled_html_pages/'
index = {}
docNum = 0
stop_words = set(stopwords.words('english'))
for filename in os.listdir(path):
collection = {}
docNum += 1
file = codecs.open('crawled_html_pages/' + filename, 'r', 'utf-8')
page_text = cleanMe(file)
tokens = nltk.word_tokenize(page_text)
filtered_sentence = [w for w in tokens if not w in stop_words]
filtered_sentence = []
breakWord = ''
for w in tokens:
if w not in stop_words:
filtered_sentence.append(w.lower())
for token in filtered_sentence:
if len(token) == 1 or token == 'and':
continue
if checkChar(token) == false:
continue
if token == 'giants':
breakWord = token
continue
if token == 'brady' and breakWord == 'giants':
break
if token not in collection:
collection[token] = 0
collection[token] += 1
for token in collection:
if tokennot in index:
index[token] = ''
index[token] = index[token] + '(' + str(docNum) + ', ' + str(collection[token]) + ")"
if docNum == 500:
print(index)
break
else:
continue
f = open('index.txt', 'w')
vocab = open('uniqueWords.txt', 'w')
for term in index:
f.write(term + ' =>' + index[term])
vocab.write(term + '\n')
f.write('\n')
f.close()
vocab.close()
print('Finished...')
这些是我得到的错误:
> C:\Users\myworld>python index.py
Traceback (most recent call last):
File "index.py][ 1]", line 49, in <module>
page_text = cleanMe(file)
File "index.py", line 22, in cleanMe
soup = BeautifulSoup(html, "html.parser")
File "C:\Users\furqa\AppData\Local\Programs\Python\Python36-32\lib\site-packages\beautifulsoup4-4.6.0-py3.6.egg\bs4\__init__.py", line 191, in __init__
File "C:\Users\furqa\AppData\Local\Programs\Python\Python36-32\lib\codecs.py", line 700, in read
return self.reader.read(size)
File "C:\Users\furqa\AppData\Local\Programs\Python\Python36-32\lib\codecs.py", line 503, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131:
invalid start byte
您可以通过更改from_encoding参数来更改使用的BeautifulSoup编码类型:
汤= BeautifulSoup(html,from_encoding =“ iso-8859-8”)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.