[英]How can I webscrape a Wikipedia table with lists of data instead of rows?
我正在尝试从位于 Wikipedia https://en.wikipedia.org/wiki/Districts_of_Warsaw页面上的 Localities 表中获取数据。
我想收集这些数据并将其放入具有两列 ["Districts"] 和 ["Neighbourhoods"] 的 dataframe 中。
到目前为止,我的代码如下所示:
url = "https://en.wikipedia.org/wiki/Districts_of_Warsaw"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, "html")
table = soup.find_all('table')[2]
A=[]
B=[]
for row in table.findAll('tr'):
cells=row.findAll('td')
if len(cells)==2:
A.append(cells[0].find(text=True))
B.append(cells[1].find(text=True))
df=pd.DataFrame(A,columns=['Neighbourhood'])
df['District']=B
print(df)
这给出了以下 dataframe:
当然,抓取 Neighborhood 列是不正确的,因为它们包含在列表中,但我不知道应该怎么做,所以如果有任何提示,我会很高兴。
除此之外,我会感谢任何提示,为什么抓取只给我 10 个区而不是 18 个区。
你确定你刮的是正确的桌子吗? 我知道您需要第二张桌子,其中包含 18 个区和列出的社区。
另外,我不确定您希望如何将地区和社区安排在 DataFrame 中,我已将地区设置为列,将社区设置为行。 您可以根据需要更改它。
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://en.wikipedia.org/wiki/Districts_of_Warsaw"
page = requests.get(url)
soup = BeautifulSoup(page.text, "html.parser")
table = soup.find_all("table")[1]
def process_list(tr):
result = []
for td in tr.findAll("td"):
result.append([x.string for x in td.findAll("li")])
return result
districts = []
neighbourhoods = []
for row in table.findAll("tr"):
if row.find("ul"):
neighbourhoods.extend(process_list(row))
else:
districts.extend([x.string.strip() for x in row.findAll("th")])
# Check and arrange as you wish
for i in range(len(districts)):
print(f'District {districts[i]} has neighbourhoods: {", ".join(neighbourhoods[i])}')
df = pd.DataFrame()
for i in range(len(districts)):
df[districts[i]] = pd.Series(neighbourhoods[i])
一些技巧:
element.string
从元素中获取文本string.strip()
删除任何前导(开头的空格)和尾随(结尾的空格)字符(空格是要删除的默认前导字符)即清理文本 您可以使用奇数行是 Districts 并且偶数行是 Neighborhoods 的事实来遍历奇数行并使用FindNext
从下面的行中获取邻居,同时在奇数行中迭代 District 列:
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
from itertools import zip_longest
soup = bs(requests.get('https://en.wikipedia.org/wiki/Districts_of_Warsaw').content, 'lxml')
table = soup.select_one('h2:contains("Localities") ~ .wikitable') #isolate table of interest
results = []
for row in table.select('tr')[0::2]: #walk the odd rows
for i in row.select('th'): #walk the districts
r = list(zip_longest([i.text.strip()] , [i.text for i in row.findNext('tr').select('li')], fillvalue=i.text.strip())) # zip the current district to the list of neighbourhoods in row below. Fill with District name to get lists of equal length
results.append(r)
results = [i for j in results for i in j] #flatten list of lists
df = pd.DataFrame(results, columns= ['District','Neighbourhood'])
print(df)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.