[英]UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 3 2: ordinal not in range(128)
[英]UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 448: ordinal not in range(128)
我目前正在使用selenium python來獲取linkedin數據。 我可以通過各種網頁解析並抓取數據,但由於Unicode錯誤,該過程在前幾頁后被中斷。 這是我的代碼:
from selenium import webdriver
from time import sleep
driver = webdriver.Firefox()
driver.get('https://www.linkedin.com/jobs/search?locationId=sg%3A0&f_TP=1%2C2&orig=FCTD&trk=jobs_jserp_posted_one_week')
result = []
while True:
while True:
try:
sleep(1)
result +=[i.text for i in driver.find_elements_by_class_name('job-title-text')]
except:
sleep(5)
else:
break
try:
for i in range(50):
nextbutton = driver.find_element_by_class_name('next-btn')
nextbutton.click()
except:
break
with open('jobtitles.csv', 'w') as f:
f.write('\n'.join(i for i in result).encode('utf-8').decode('utf-8'))
您可以使用UnicodeWriter(來自Python文檔):
import codecs
import cStringIO
import csv
from time import sleep
from selenium import webdriver
class UnicodeWriter:
"""
A CSV writer which will write rows to CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
# Redirect output to a queue
self.queue = cStringIO.StringIO()
self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
self.stream = f
self.encoder = codecs.getincrementalencoder(encoding)()
def writerow(self, row):
self.writer.writerow([s.encode("utf-8") for s in row])
# Fetch UTF-8 output from the queue ...
data = self.queue.getvalue()
data = data.decode("utf-8")
# ... and reencode it into the target encoding
data = self.encoder.encode(data)
# write to the target stream
self.stream.write(data)
# empty queue
self.queue.truncate(0)
def writerows(self, rows):
for row in rows:
self.writerow(row)
driver = webdriver.Firefox()
driver.get('https://www.linkedin.com/jobs/search?locationId=sg%3A0&f_TP=1%2C2&orig=FCTD&trk=jobs_jserp_posted_one_week')
result = []
while True:
while True:
try:
sleep(1)
result +=[i.text for i in driver.find_elements_by_class_name('job-title-text')]
except:
sleep(5)
else:
break
try:
for i in range(50):
nextbutton = driver.find_element_by_class_name('next-btn')
nextbutton.click()
except:
break
with open('jobtitles.csv', 'w') as f:
doc = UnicodeWriter(f)
doc.writerows(result)
這是一個不正確的編碼...你聲稱一個字節流是由UTF-8編碼的,根據UTF-8的實現,只有ascii字符(0-127)是允許的,因此UTF-在引用位置是不正確的8解碼失敗...我沒有看到你的代碼UTF-8解碼失敗的方式和時間,所以你應該自己跟蹤確切位置檢查變量類型(),並請注意python 2和3得到這方面的差異
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
print sys.getdefaultencoding()
將它添加到代碼的頂部。
此外,您可能需要預處理您的代碼以替換一些非英語單詞
words=word_tokenize(content)
# print words
word=[]
for w in words:
w= re.sub(r'[^\w\s]', '',w)
w =re.sub("[^A-Za-z]+"," ",w,flags=re.MULTILINE)
w =w .strip("\t\n\r")
word.append(w)
words=word
# print words
stop_words = set(stopwords.words('english'))
filteredword = [w for w in words if not w in stop_words and 3 < len(w)]
# print filteredword
words=" ".join(filteredword)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.