Wiki Scraping 缺失数据

Question

I am trying to extract the table from https://en.wikipedia.org/wiki/Megacity as my initial foray into the world of scraping (in full transparency, I took this code from a blog I read).我正在尝试从https://en.wikipedia.org/wiki/Megacity中提取表格，作为我首次涉足抓取世界（以完全透明的方式，我从阅读的博客中获取此代码）。 I got the program to work but instead of getting the city, I have \n (also on every field. Question: Why do I have \n at the end of every field and why is my first field (city) blank? Listed below is part of the code and output.我让程序正常工作，但我没有得到城市，而是 \n （也在每个字段上。问题：为什么我在每个字段的末尾都有 \n ，为什么我的第一个字段（城市）是空白的？下面列出是代码和 output 的一部分。

import requests
scrapeLink = 'https://en.wikipedia.org/wiki/Megacity'
page = requests.get(scrapeLink)

from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')

megaTable = soup.find_all('table')[1]


rowValList = []    
for i in range(len(megaTable.find_all('td'))):
    rowVal = megaTable.find_all('td')[i].get_text()
    rowValList.append(rowVal)

cityList = []
for i in range(0, len(rowValList), 6):
    cityList.append(rowValList[i])

countryList = []
for i in range(1, len(rowValList), 6):
    countryList.append(rowValList[i])

contList = []
for i in range(2, len(rowValList), 6):
    contList.append(rowValList[i])

popList = []
for i in range(3, len(rowValList), 6):
    popList.append(rowValList[i])

import pandas as pd

megaDf = pd.DataFrame()
megaDf['City'] = cityList
megaDf['Country'] = countryList
megaDf['Continent'] = contList
megaDf['Population'] = popList
megaDf

Answer 1

The reason is that the city is not inside a td tag but a th tag.原因是城市不在td标签内，而是在th标签内。

<th scope="row"><a href="/wiki/Bangalore" title="Bangalore">Bangalore</a></th>

And the first td you are referring to is infact the image column.您所指的第一个 td 实际上是图像列。 You can select the city name by getting the th tag.通过获取th标签，您可以 select 城市名称。

Also, you can simplify your crawler by first getting the rows of the table and then select the necessary tag for each row, ie th and td .此外，您可以通过首先获取表的行，然后 select 为每一行提供必要的标签，即th和td来简化爬虫。

import requests
from bs4 import BeautifulSoup

scrapeLink = "https://en.wikipedia.org/wiki/Megacity"
page = requests.get(scrapeLink)


soup = BeautifulSoup(page.content, "html.parser")

megaTable = soup.find_all("table")[1]

cities = []
# [:2] slices the array since the first 2 `tr` contains the headers 
for row in megaTable.find_all("tr")[2:]:
    city = row.th.get_text().strip()
    tds = row.find_all("td")
    country = tds[1].get_text().strip()
    continent = tds[2].get_text().strip()
    population = tds[3].get_text().strip()
    cities.append({
        "city": city,
        "country": country,
        "continent": continent,
        "popluation": population,
    })

print(cities)
[
    {
        "city": "Bangalore",
        "country": "India",
        "continent": "Asia",
        "population": "12,200,00"
    },
    # and so on
]

You can then convert the list into a dataframe:然后，您可以将列表转换为 dataframe：

df = pd.DataFrame(cities)

Wiki Scraping 缺失数据

问题描述

1 个解决方案

解决方案1
0 2020-04-13 03:55:34

Wiki Scraping 缺失数据

问题描述

1 个解决方案

解决方案1 0 2020-04-13 03:55:34

解决方案1
0 2020-04-13 03:55:34