简体   繁体   中英

Getting an error while Scraping the dates

I am scraping lists of US presidents using beautiful soup and requests. I want to scrape both the date for example start of the presidency and end of the presidency date and for some reason it's showing list index out of range error . I'll Provide you the link so you can understand better . website Link : https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States

from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
my_url = 'https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = BeautifulSoup(page_html , 'html.parser' )
containers = page_soup.find_all('table' , class_ = 'wikitable')
#print(containers[0])
#print(len(containers))
#print(soup.prettify(containers[0]))
container = containers[0]
date =container.find_all('span' , attrs = {'class': 'date'})
#print(len(date))
#print(date[0].text)

for container in containers:
    date_container = container.find_all('span', attrs={'class': 'date'})
    print(date_container[0].text)

The find_all function can return an empty list, which can lead you to getting an error.

You can simple check this:

all_dates = []
for container in containers:
    date_container = container.find_all('span', attrs={'class': 'date'})
    all_dates.extend([date.text for date in date_container])

As you have last lines of code, that store all spans of dates on first table "wikitable", you can make list comprehension:

date = [x.text for x in container.find_all('span' , attrs = {'class': 'date'})]
print(date)

Which will print:

['April 30, 1789', 'March 4, 1797', 'March 4, 1797', 'March 4, 1801', 'March 4, 1801'...

Since it has <table> tags, have you considered using pandas' .read_html() ? It uses BeautifulSoup under the hood. Takes alot of the work out and puts it straight into a dataframe for you. The only work then needed is any manipulation or cleanup/filtering:

import pandas as pd
import re

my_url = 'https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States'

# Returns a list of dataframes
dfs = pd.read_html(my_url)

# Get the specific dataframe with the desired columns
df = dfs[1].iloc[:,[1,3]]

# Rename the columns
df.columns = ['Date','Name']

# Split the date column into start and end dates and drop the date column
df[['Start','End']] = df.Date.str.split('–', expand=True)
df = df.drop('Date',axis=1)

# Clean up the name column using regex to pull out the name
df['Name'] =  [re.match(r'.+?(?=\d)', x)[0].strip().split('Born')[0] for x in df['Name']]

# Drop duplicate rows
df.drop_duplicates(inplace = True) 


print (df)

Output:

print (df.to_string())
                      Name                  Start                               End
0        George Washington      April 30, 1789[d]                     March 4, 1797
1               John Adams          March 4, 1797                     March 4, 1801
2         Thomas Jefferson          March 4, 1801                     March 4, 1809
3            James Madison          March 4, 1809                     March 4, 1817
4             James Monroe          March 4, 1817                     March 4, 1825
5        John Quincy Adams          March 4, 1825                     March 4, 1829
6           Andrew Jackson          March 4, 1829                     March 4, 1837
7         Martin Van Buren          March 4, 1837                     March 4, 1841
8   William Henry Harrison          March 4, 1841     April 4, 1841(Died in office)
9               John Tyler       April 4, 1841[i]                     March 4, 1845
10           James K. Polk          March 4, 1845                     March 4, 1849
11          Zachary Taylor          March 4, 1849      July 9, 1850(Died in office)
12        Millard Fillmore        July 9, 1850[k]                     March 4, 1853
13         Franklin Pierce          March 4, 1853                     March 4, 1857
14          James Buchanan          March 4, 1857                     March 4, 1861
15         Abraham Lincoln          March 4, 1861      April 15, 1865(Assassinated)
16          Andrew Johnson         April 15, 1865                     March 4, 1869
17        Ulysses S. Grant          March 4, 1869                     March 4, 1877
18     Rutherford B. Hayes          March 4, 1877                     March 4, 1881
19       James A. Garfield          March 4, 1881  September 19, 1881(Assassinated)
20       Chester A. Arthur  September 19, 1881[n]                     March 4, 1885
21        Grover Cleveland          March 4, 1885                     March 4, 1889
22       Benjamin Harrison          March 4, 1889                     March 4, 1893
23        Grover Cleveland          March 4, 1893                     March 4, 1897
24        William McKinley          March 4, 1897  September 14, 1901(Assassinated)
25      Theodore Roosevelt     September 14, 1901                     March 4, 1909
26     William Howard Taft          March 4, 1909                     March 4, 1913
27          Woodrow Wilson          March 4, 1913                     March 4, 1921
28       Warren G. Harding          March 4, 1921    August 2, 1923(Died in office)
29         Calvin Coolidge      August 2, 1923[o]                     March 4, 1929
30          Herbert Hoover          March 4, 1929                     March 4, 1933
31   Franklin D. Roosevelt          March 4, 1933    April 12, 1945(Died in office)
32         Harry S. Truman         April 12, 1945                  January 20, 1953
33    Dwight D. Eisenhower       January 20, 1953                  January 20, 1961
34         John F. Kennedy       January 20, 1961   November 22, 1963(Assassinated)
35       Lyndon B. Johnson      November 22, 1963                  January 20, 1969
36           Richard Nixon       January 20, 1969          August 9, 1974(Resigned)
37             Gerald Ford         August 9, 1974                  January 20, 1977
38            Jimmy Carter       January 20, 1977                  January 20, 1981
39           Ronald Reagan       January 20, 1981                  January 20, 1989
40       George H. W. Bush       January 20, 1989                  January 20, 1993
41            Bill Clinton       January 20, 1993                  January 20, 2001
42          George W. Bush       January 20, 2001                  January 20, 2009
43            Barack Obama       January 20, 2009                  January 20, 2017
44            Donald Trump       January 20, 2017                         Incumbent

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM