[英]Why do only the first two rows of html table from web page get read?
I'm trying to scrape data from an html table on a web page.我正在尝试从 web 页面上的 html 表中抓取数据。 I've tried a few different methods based on answers posted here, but always getting a problem: the result is roughly what I expect but only for the first two rows of the table.我根据此处发布的答案尝试了几种不同的方法,但总是遇到问题:结果大致符合我的预期,但仅针对表格的前两行。 I have little experience with html and beautiful soup, but from the html file of the table in the url I can't see any difference between the first two rows and the rest of the table.我对html和美汤没有什么经验,但是从url中表格的html文件看不出前两行和表格的rest有什么区别。 Could anyone help me figure out what I'm doing wrong?谁能帮我弄清楚我做错了什么?
import numpy
import pandas as pd
import urllib
from bs4 import BeautifulSoup
url = 'http://www.astronomy.ohio-state.edu/asassn/transients.html'
# First method
source = urllib.request.urlopen(url).read()
soup = BeautifulSoup(source,'lxml')
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
print(tr)
>>>(prints html text for first two rows)
# Second method
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print (df)
>>> ASAS-SN Other ATEL RA Dec ... SDSS DSS Vizier Spectroscopic Class Comments
0 ID IDs TNS NaN NaN ... image image data NaN NaN
[1 rows x 12 columns]
Sometimes people create incorrect HTML but web browsers may display this correctly because there don't respect strictly all rules.有时人们创建不正确的 HTML 但 web 浏览器可能会正确显示它,因为没有严格遵守所有规则。
BeautifulSoup
can use three different parsers - lxml
, html.parser
and html5lib
and they may parse broken HTML in different way. BeautifulSoup
可以使用三种不同的解析器lxml
、 html.parser
和html5lib
,它们可能会以不同的方式解析损坏的 HTML。
soup = BeautifulSoup(source, 'lxml')
soup = BeautifulSoup(source, 'html.parser')
soup = BeautifulSoup(source, 'html5lib')
It seems lxml
skips wrong elements and you get only two rows but html5lib
can parse incorrect elements and gives all rows.看起来lxml
跳过了错误的元素,你只得到两行,但html5lib
可以解析不正确的元素并给出所有行。
soup = BeautifulSoup(source, 'html5lib')
BTW:顺便提一句:
In BeautifulSoup
documentation in section installing a parser you can see table with comparition of these parsers (advantages/disadvantages).在安装解析器部分的BeautifulSoup
文档中,您可以看到这些解析器的比较表(优点/缺点)。
People may use lxml
because it uses code in C/C++ so it can works faster.人们可能会使用lxml
,因为它使用 C/C++ 中的代码,因此它可以更快地工作。
BTW:顺便提一句:
pandas
can read HTML table directly from url
pandas
可以直接从url
读取HTML表
read_html(url)
In documentation you can see that it uses lxml
as default parser but you may change it to bs4+html5lib
在文档中,您可以看到它使用lxml
作为默认解析器,但您可以将其更改为bs4+html5lib
read_html(html, flavor="bs4")
read_html(url, flavor="bs4")
or要么
read_html(html, flavor="html5lib")
read_html(url, flavor="html5lib")
I tested it and flavor="html5lib"
gives all rows.我测试了它并且flavor="html5lib"
给出了所有行。
Check this out, it worked for me.看看这个,它对我有用。
import requests
url = 'http://www.astronomy.ohio-state.edu/asassn/transients.html'
df_list = pd.read_html(url, header=0, flavor='bs4')
df = df_list[0]
print (df)
I'm trying to scrape data from an html table on a web page.我正在尝试从 web 页面上的 html 表中抓取数据。 I've tried a few different methods based on answers posted here, but always getting a problem: the result is roughly what I expect but only for the first two rows of the table.我根据此处发布的答案尝试了几种不同的方法,但总是遇到问题:结果大致符合我的预期,但仅适用于表格的前两行。 I have little experience with html and beautiful soup, but from the html file of the table in the url I can't see any difference between the first two rows and the rest of the table. I have little experience with html and beautiful soup, but from the html file of the table in the url I can't see any difference between the first two rows and the rest of the table. Could anyone help me figure out what I'm doing wrong?谁能帮我弄清楚我做错了什么?
import numpy
import pandas as pd
import urllib
from bs4 import BeautifulSoup
url = 'http://www.astronomy.ohio-state.edu/asassn/transients.html'
# First method
source = urllib.request.urlopen(url).read()
soup = BeautifulSoup(source,'lxml')
table = soup.find('table')
table_rows = table.find_all('tr')
for tr in table_rows:
print(tr)
>>>(prints html text for first two rows)
# Second method
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print (df)
>>> ASAS-SN Other ATEL RA Dec ... SDSS DSS Vizier Spectroscopic Class Comments
0 ID IDs TNS NaN NaN ... image image data NaN NaN
[1 rows x 12 columns]
I think the problem here is the index.我认为这里的问题是索引。 Try this code for the second method为第二种方法尝试此代码
html = requests.get(url).content
df_list = pd.read_html(html, header=0)
df = df_list[0]
print (df)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.