![](/img/trans.png)
[英]How to get current url of a parsed HTML page in Python with lxml?
[英]How to get rid of \ufeff in parsed html page
!wget -q -O 'boroughs.html' "https://en.wikipedia.org/wiki/List_of_London_boroughs"
with open('boroughs.html', encoding='utf-8-sig') as fp:
soup = BeautifulSoup(fp,"lxml")
data = []
table = soup.find("table", { "class" : "wikitable sortable" })
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [col.text.strip() for col in cols]
data.append([col for col in cols]) # Get rid of empty values
data
嘗試使用utf8
代替:
with open('boroughs.html', encoding='utf8') as fp:
doc = html.fromstring(fp.read())
data = []
rows = doc.xpath("//table/tbody/tr")
for row in rows:
cols = row.xpath("./td/text()")
cols = [col.strip() for col in cols if col.strip()]
data.append(cols)
import os
from bs4 import BeautifulSoup
os.system('wget -q -O "boroughs.html" "https://en.wikipedia.org/wiki/List_of_London_boroughs"')
with open('boroughs.html', encoding='utf-8-sig') as fp:
soup = BeautifulSoup(fp,"lxml")
data = []
table = soup.find("table", { "class" : "wikitable sortable" })
table_body = table.find('tbody')
rows = table_body.find_all('tr')
for row in rows:
cols = row.find_all('td')
cols = [col.text.strip() for col in cols]
data.append([col.replace(u'\ufeff', '') for col in cols])
print(data)
嘗試以下操作:
with open('boroughs.html', encoding='utf-8-sig') as fp:
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.