如何遍历列表Web抓取的表格列并为每个项目返回一个结果？

Question

I have a python code that web scrape the correct data but the guests column has more than one string in and is currently only pulling through one. 我有一个python代码，可在网络上抓取正确的数据，但guest虚拟机列中包含多个字符串，并且目前仅通过一个字符串。 So how do I iterate through the list within that column cell and return the 3 guests as a separate columns for each hopefully guest1, guest2, guest3? 那么，如何遍历该列单元格中的列表，并将3个guest作为单独的列返回给希望的guest1，guest2，guest3？ Thanks 谢谢

import requests
import pandas as pd
from bs4 import BeautifulSoup
import numpy as np

df = pd.DataFrame(columns=(['NoInSeason', 'Guests', 'Winner', 'OriginalAirDate']))
page = requests.get("https://en.wikipedia.org/wiki/List_of_QI_episodes")
soup = BeautifulSoup(page.content, "lxml")
my_tables = soup.find_all("table",{"class":"wikitable plainrowheaders wikiepisodetable"})
for table in my_tables:
    table_rows = table.find_all("tr")
    for tr in table_rows:
        td = tr.find_all("td")
        if len(td) == 5:
            NoInSeason = td[0].find(text=True)
            Guests = td[2].find_all(text=True)
            Winner  = td[3].find(text=True)
            OriginalAirDate = td[4].find(text=True) 
            if len(Guests) == 3:
                Guest1 = Guests[0]
                Guest2 = Guests[1]
                Guest3 = Guests[2]
                df = df.append({'NoInSeason': NoInSeason, 'Guest1' : Guest1, 'Guest2' : Guest2, 'Guest3' : Guest3, 'Winner': Winner, 'OriginalAirDate' : OriginalAirDate}, ignore_index=True)
df.to_csv("output.csv")
print(df)

Answer 1

Is this what you were looking for? 这是您要找的东西吗？

df = pd.DataFrame(columns=(['NoInSeason', 'Guest 1', 
'Guest 2', 'Guest 3', 'Winner', 'OriginalAirDate']))
page = 
  requests.get("https://en.wikipedia.org/wiki/List_of_QI_episodes")
soup = BeautifulSoup(page.content, "lxml")
my_tables = soup.find_all("table",{"class":"wikitable plainrowheaders wikiepisodetable"})
for table in my_tables:
    table_rows = table.find_all("tr")
    for tr in table_rows:
        td = tr.find_all("td")
        if len(td) == 5:
            NoInSeason = td[0].find(text=True)
            Guests = td[2].find_all(text=True)
            Winner  = td[3].find(text=True)
            OriginalAirDate = td[4].find(text=True)
            print(Guests)
            try:
                df = df.append({'NoInSeason': NoInSeason, 'Guest 1' : Guests[0], 'Guest 2' : Guests[1], 'Guest 3' : Guests[2], 'Winner': Winner, 'OriginalAirDate' : OriginalAirDate}, ignore_index=True)
            except IndexError as index_error:
                continue
print(df)

Edit: I see you changed your code, does it now work? 编辑：我看到您更改了代码，现在可以了吗？ And would it not work better including the Guest1, Guest2, and Guest3 columns in the DataFrame so that you don't get a 'Guests' column full of NaN? 而且，在DataFrame中包含Guest1，Guest2和Guest3列是否会更好，这样您就不会得到充满NaN的“ Guests”列？

如何遍历列表Web抓取的表格列并为每个项目返回一个结果？

问题描述

1 个解决方案

解决方案1
1 2019-02-17 14:28:20

如何遍历列表Web抓取的表格列并为每个项目返回一个结果？

问题描述

1 个解决方案

解决方案1 1 2019-02-17 14:28:20

解决方案1
1 2019-02-17 14:28:20