简体   繁体   中英

Store scraped table as dictionary and output as pandas DataFrame

I've scraped some data from the site given below. I'm having trouble taking the output of this data on excel. Also, I have stored the table I have scraped as dictionary. But the key and value pair are not in sync. Would somebody please help.

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd



url = requests.get("http://stats.espncricinfo.com/ci/content/records/307847.html" )
soup = bs(url.text, 'lxml')
soup_1 = soup.find(class_ = "recordsTable")
soup_pages = soup_1.find_all('a', href= True)

state_links =[]

for link in soup_pages:
state_links.append(link['href'])


for i in state_links:
parse_link = "http://stats.espncricinfo.com"+i
url_new = requests.get(parse_link)
soup_new = bs(url_new.text, 'lxml')
soup_table = soup_new.find(class_="engineTable")
results = {}
newdict = dict()

for col in soup_table.findAll('th'):
    colname = (col.text).lstrip().rstrip()

for row in soup_table.findAll("td"):
    rowname = row.text.lstrip().rstrip()

newdict[col.text] = row.text
print (newdict)

You are iterating the list and storing in the same variable which will override every time you iterate. Try the below code I think it will work.

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd

url =requests.get("http://stats.espncricinfo.com/ci/content/records/307847.html" )
soup = bs(url.text, 'lxml')
soup_1 = soup.find(class_ = "recordsTable")
soup_pages = soup_1.find_all('a', href= True)

state_links =[]
state_id =[]
for link in soup_pages:
    state_links.append(link['href'])
    state_id.append(link.getText())

Total_dict = dict()

for a,year in zip(state_links,state_id):
    parse_link = "http://stats.espncricinfo.com"+a
    url_new = requests.get(parse_link)
    soup_new = bs(url_new.text, 'lxml')
    soup_table = soup_new.find(class_="engineTable")
    newdictlist = list()
    col_name =list()
    row_name =list()
    for col in soup_table.findAll('th'):
        col_name.append((col.text).lstrip().rstrip())
    for row in soup_table.findAll("td"):
        row_name.append(row.text.lstrip().rstrip())
    no_of_matches = len(row_name)/len(col_name)
    row_count=0
    for h in range(int(no_of_matches)):
        newdict = dict()
        for i in col_name:
            newdict[i] = row_name[row_count]
            row_count=row_count+1
        newdictlist.append(newdict)
    print(newdictlist)
    Total_dict[year] = newdictlist
print(Total_dict)

ouput:{'1877': [{'Team 1': 'Australia', 'Team 2': 'England', 'Winner': 'Australia', 'Margin': '45 runs', 'Ground': 'Melbourne', 'Match Date': 'Mar 15-19, 1877', 'Scorecard': 'Test # 1'}, {'Team 1': 'Australia', 'Team 2': 'England', 'Winner': 'England', 'Margin': '4 wickets', 'Ground': 'Melbourne', 'Match Date': 'Mar 31-Apr 4, 1877', 'Scorecard': 'Test # 2'}] ,['1879':[{'Team 1': 'Australia', 'Team 2': 'England', 'Winner': 'Australia', 'Margin': '10 wickets', 'Ground': 'Melbourne', 'Match Date': 'Jan 2-4, 1879', 'Scorecard': 'Test # 3'}],............}

You have 2 loops but didn't store colname and rowname to add to newdict. Here is my solution. Please be aware of case size of val_list > size of key_list

# create 2 lists to store key and value
key_list = []
val_list = []
newdict = dict()
for col in soup_table.findAll('th'):
    key_list.append((col.text).lstrip().rstrip())

for row in soup_table.findAll("td"):
    val_list.append(row.text.lstrip().rstrip())

index = 0
# loop key_list and add key pair to dict
for key in key_list:                    
    newdict[key] = val_list(index)
    index += 1
print(newdict)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM