I'm trying to scrape data from an html table on a web page. I've tried a few different methods based on answers posted here, but always getting a problem: the result is roughly what I expect but only for the first two rows of the table. I have little experience with html and beautiful soup, but from the html file of the table in the url I can't see any difference between the first two rows and the rest of the table. Could anyone help me figure out what I'm doing wrong?
import numpy
import pandas as pd
import urllib
from bs4 import BeautifulSoup
url = 'http://www.astronomy.ohio-state.edu/asassn/transients.html'
# First method
source = urllib.request.urlopen(url).read()
soup = BeautifulSoup(source,'lxml')
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
print(tr)
>>>(prints html text for first two rows)
# Second method
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print (df)
>>> ASAS-SN Other ATEL RA Dec ... SDSS DSS Vizier Spectroscopic Class Comments
0 ID IDs TNS NaN NaN ... image image data NaN NaN
[1 rows x 12 columns]
Sometimes people create incorrect HTML but web browsers may display this correctly because there don't respect strictly all rules.
BeautifulSoup
can use three different parsers - lxml
, html.parser
and html5lib
and they may parse broken HTML in different way.
soup = BeautifulSoup(source, 'lxml')
soup = BeautifulSoup(source, 'html.parser')
soup = BeautifulSoup(source, 'html5lib')
It seems lxml
skips wrong elements and you get only two rows but html5lib
can parse incorrect elements and gives all rows.
soup = BeautifulSoup(source, 'html5lib')
BTW:
In BeautifulSoup
documentation in section installing a parser you can see table with comparition of these parsers (advantages/disadvantages).
People may use lxml
because it uses code in C/C++ so it can works faster.
BTW:
pandas
can read HTML table directly from url
read_html(url)
In documentation you can see that it uses lxml
as default parser but you may change it to bs4+html5lib
read_html(html, flavor="bs4")
read_html(url, flavor="bs4")
or
read_html(html, flavor="html5lib")
read_html(url, flavor="html5lib")
I tested it and flavor="html5lib"
gives all rows.
Check this out, it worked for me.
import requests
url = 'http://www.astronomy.ohio-state.edu/asassn/transients.html'
df_list = pd.read_html(url, header=0, flavor='bs4')
df = df_list[0]
print (df)
I'm trying to scrape data from an html table on a web page. I've tried a few different methods based on answers posted here, but always getting a problem: the result is roughly what I expect but only for the first two rows of the table. I have little experience with html and beautiful soup, but from the html file of the table in the url I can't see any difference between the first two rows and the rest of the table. Could anyone help me figure out what I'm doing wrong?
import numpy
import pandas as pd
import urllib
from bs4 import BeautifulSoup
url = 'http://www.astronomy.ohio-state.edu/asassn/transients.html'
# First method
source = urllib.request.urlopen(url).read()
soup = BeautifulSoup(source,'lxml')
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
print(tr)
>>>(prints html text for first two rows)
# Second method
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print (df)
>>> ASAS-SN Other ATEL RA Dec ... SDSS DSS Vizier Spectroscopic Class Comments
0 ID IDs TNS NaN NaN ... image image data NaN NaN
[1 rows x 12 columns]
I think the problem here is the index. Try this code for the second method
html = requests.get(url).content
df_list = pd.read_html(html, header=0)
df = df_list[0]
print (df)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.