简体   繁体   中英

Why do only the first two rows of html table from web page get read?

I'm trying to scrape data from an html table on a web page. I've tried a few different methods based on answers posted here, but always getting a problem: the result is roughly what I expect but only for the first two rows of the table. I have little experience with html and beautiful soup, but from the html file of the table in the url I can't see any difference between the first two rows and the rest of the table. Could anyone help me figure out what I'm doing wrong?

import numpy
import pandas as pd
import urllib
from bs4 import BeautifulSoup

url = 'http://www.astronomy.ohio-state.edu/asassn/transients.html'

# First method
source = urllib.request.urlopen(url).read()
soup = BeautifulSoup(source,'lxml')
table = soup.find('table')
table_rows = table.find_all('tr')

for tr in table_rows:
    print(tr)

>>>(prints html text for first two rows)


# Second method
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print (df)

>>> ASAS-SN Other ATEL  RA  Dec  ...   SDSS    DSS Vizier Spectroscopic Class Comments
0      ID   IDs  TNS NaN  NaN  ...  image  image   data                 NaN      NaN

[1 rows x 12 columns]

Sometimes people create incorrect HTML but web browsers may display this correctly because there don't respect strictly all rules.

BeautifulSoup can use three different parsers - lxml , html.parser and html5lib and they may parse broken HTML in different way.

soup = BeautifulSoup(source, 'lxml')
soup = BeautifulSoup(source, 'html.parser')
soup = BeautifulSoup(source, 'html5lib')

It seems lxml skips wrong elements and you get only two rows but html5lib can parse incorrect elements and gives all rows.

soup = BeautifulSoup(source, 'html5lib')

BTW:

In BeautifulSoup documentation in section installing a parser you can see table with comparition of these parsers (advantages/disadvantages).

People may use lxml because it uses code in C/C++ so it can works faster.


BTW:

pandas can read HTML table directly from url

read_html(url)

In documentation you can see that it uses lxml as default parser but you may change it to bs4+html5lib

read_html(html, flavor="bs4")

read_html(url, flavor="bs4")

or

read_html(html, flavor="html5lib")

read_html(url, flavor="html5lib")

I tested it and flavor="html5lib" gives all rows.

Check this out, it worked for me.

import requests
url = 'http://www.astronomy.ohio-state.edu/asassn/transients.html'
df_list = pd.read_html(url, header=0, flavor='bs4')
df = df_list[0]
print (df)

I'm trying to scrape data from an html table on a web page. I've tried a few different methods based on answers posted here, but always getting a problem: the result is roughly what I expect but only for the first two rows of the table. I have little experience with html and beautiful soup, but from the html file of the table in the url I can't see any difference between the first two rows and the rest of the table. Could anyone help me figure out what I'm doing wrong?

import numpy
import pandas as pd
import urllib
from bs4 import BeautifulSoup

url = 'http://www.astronomy.ohio-state.edu/asassn/transients.html'

# First method
source = urllib.request.urlopen(url).read()
soup = BeautifulSoup(source,'lxml')
table = soup.find('table')
table_rows = table.find_all('tr')

for tr in table_rows:
    print(tr)

>>>(prints html text for first two rows)


# Second method
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print (df)

>>> ASAS-SN Other ATEL  RA  Dec  ...   SDSS    DSS Vizier Spectroscopic Class Comments
0      ID   IDs  TNS NaN  NaN  ...  image  image   data                 NaN      NaN

[1 rows x 12 columns]

I think the problem here is the index. Try this code for the second method

html = requests.get(url).content
df_list = pd.read_html(html, header=0)
df = df_list[0]
print (df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM