简体   繁体   English

Wiki Scraping 缺失数据

[英]Wiki Scraping missing data

I am trying to extract the table from https://en.wikipedia.org/wiki/Megacity as my initial foray into the world of scraping (in full transparency, I took this code from a blog I read).我正在尝试从https://en.wikipedia.org/wiki/Megacity中提取表格,作为我首次涉足抓取世界(以完全透明的方式,我从阅读的博客中获取此代码)。 I got the program to work but instead of getting the city, I have \n (also on every field. Question: Why do I have \n at the end of every field and why is my first field (city) blank? Listed below is part of the code and output.我让程序正常工作,但我没有得到城市,而是 \n (也在每个字段上。问题:为什么我在每个字段的末尾都有 \n ,为什么我的第一个字段(城市)是空白的?下面列出是代码和 output 的一部分。

import requests
scrapeLink = 'https://en.wikipedia.org/wiki/Megacity'
page = requests.get(scrapeLink)

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

megaTable = soup.find_all('table')[1]


rowValList = []    
for i in range(len(megaTable.find_all('td'))):
    rowVal = megaTable.find_all('td')[i].get_text()
    rowValList.append(rowVal)

cityList = []
for i in range(0, len(rowValList), 6):
    cityList.append(rowValList[i])

countryList = []
for i in range(1, len(rowValList), 6):
    countryList.append(rowValList[i])

contList = []
for i in range(2, len(rowValList), 6):
    contList.append(rowValList[i])

popList = []
for i in range(3, len(rowValList), 6):
    popList.append(rowValList[i])

import pandas as pd

megaDf = pd.DataFrame()
megaDf['City'] = cityList
megaDf['Country'] = countryList
megaDf['Continent'] = contList
megaDf['Population'] = popList
megaDf

输出

The reason is that the city is not inside a td tag but a th tag.原因是城市不在td标签内,而是在th标签内。

<th scope="row"><a href="/wiki/Bangalore" title="Bangalore">Bangalore</a></th>

And the first td you are referring to is infact the image column.您所指的第一个 td 实际上是图像列。 You can select the city name by getting the th tag.通过获取th标签,您可以 select 城市名称。

Also, you can simplify your crawler by first getting the rows of the table and then select the necessary tag for each row, ie th and td .此外,您可以通过首先获取表的行,然后 select 为每一行提供必要的标签,即thtd来简化爬虫。

import requests
from bs4 import BeautifulSoup

scrapeLink = "https://en.wikipedia.org/wiki/Megacity"
page = requests.get(scrapeLink)


soup = BeautifulSoup(page.content, "html.parser")

megaTable = soup.find_all("table")[1]

cities = []
# [:2] slices the array since the first 2 `tr` contains the headers 
for row in megaTable.find_all("tr")[2:]:
    city = row.th.get_text().strip()
    tds = row.find_all("td")
    country = tds[1].get_text().strip()
    continent = tds[2].get_text().strip()
    population = tds[3].get_text().strip()
    cities.append({
        "city": city,
        "country": country,
        "continent": continent,
        "popluation": population,
    })

print(cities)
[
    {
        "city": "Bangalore",
        "country": "India",
        "continent": "Asia",
        "population": "12,200,00"
    },
    # and so on
]

You can then convert the list into a dataframe:然后,您可以将列表转换为 dataframe:

df = pd.DataFrame(cities)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM