need to extract data from <a href="#">Data</a>
from url
below. Any clue how to extract this table into DataFrames?
from bs4 import BeautifulSoup
import requests
url = 'https://docs.google.com/spreadsheets/d/1dgOdlUEq6_V55OHZCxz5BG_0uoghJTeA6f83br5peNs/pub?range=A1:D70&gid=1&output=html#'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, features='html.parser')
#print(soup.prettify())
print(soup.title)
It might be easier to start of with a multi-dimensional list, then port it to a DataFrame, that way we aren't assuming sizes. The "Data" hyperlink references the div id=0, so we select all elements in that, then parse each column in each row into a an list array (in which I call elements ) that gets appended to a full list array (which I call fullelements ) and reset for each new row.
from bs4 import BeautifulSoup
import pandas as pd
import requests
url = 'https://docs.google.com/spreadsheets/d/1dgOdlUEq6_V55OHZCxz5BG_0uoghJTeA6f83br5peNs/pub?range=A1:D70&gid=1&output=html#'
r = requests.get(url)
html_doc = r.text
soup = BeautifulSoup(html_doc, features='html.parser')
#print(soup.prettify())
print(soup.title.text)
datadiv=soup.find("div", {"id": "0"})
elementsfull =[]
row=0
for tr in datadiv.findAll("tr"):
elements=[]
column=0
for td in tr.findAll("td"):
if(td.text!=''):
elements.append(td.text)
column+=1
#print('column: ', column)
elementsfull.append(elements)
#print('row: ', row)
row+=1
mydf = pd.DataFrame(data=elementsfull)
print(mydf)
I tested this code and checked it against the table, so I guarantee it works.
import bs4 as bs
import requests
import pandas as pd
url = 'https://docs.google.com/spreadsheets/d/1dgOdlUEq6_V55OHZCxz5BG_0uoghJTeA6f83br5peNs/pub?range=A1:D70&gid=1&output=html#'
r = requests.get(url)
html_doc = r.text
soup = bs.BeautifulSoup(html_doc, features='html.parser')
table = soup.find('table', attrs={'class':'subs noBorders evenRows'})
table_rows = soup.find_all('tr')
list1 = []
for tr in table_rows:
td = tr.find_all('td')
row = [tr.text for tr in td]
list1.append(row)
df=pd.DataFrame(list1)
df.columns = df.iloc[1]
#starting from this point,it's just how you want to clean and slice the data
df = df.iloc[3:263] #check the data to see if you want to only read these
df.dropna(axis='columns', how='all', inplace=True)
You could read_html and handle dataframe as required
import pandas as pd
results = pd.read_html('https://docs.google.com/spreadsheets/d/1dgOdlUEq6_V55OHZCxz5BG_0uoghJTeA6f83br5peNs/pub?range=A1:D70&gid=1&output=html#')
result = results[0].dropna(how='all')
del result[0]
result.dropna(axis='columns', how='all', inplace=True)
result.to_csv(r'C:\Users\User\Desktop\Data.csv', sep=',', encoding='utf_8_sig',index = False, header=None)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.