简体   繁体   English

为什么只读取 web 页中 html 表的前两行?

[英]Why do only the first two rows of html table from web page get read?

I'm trying to scrape data from an html table on a web page.我正在尝试从 web 页面上的 html 表中抓取数据。 I've tried a few different methods based on answers posted here, but always getting a problem: the result is roughly what I expect but only for the first two rows of the table.我根据此处发布的答案尝试了几种不同的方法,但总是遇到问题:结果大致符合我的预期,但仅针对表格的前两行。 I have little experience with html and beautiful soup, but from the html file of the table in the url I can't see any difference between the first two rows and the rest of the table.我对html和美汤没有什么经验,但是从url中表格的html文件看不出前两行和表格的rest有什么区别。 Could anyone help me figure out what I'm doing wrong?谁能帮我弄清楚我做错了什么?

import numpy
import pandas as pd
import urllib
from bs4 import BeautifulSoup

url = 'http://www.astronomy.ohio-state.edu/asassn/transients.html'

# First method
source = urllib.request.urlopen(url).read()
soup = BeautifulSoup(source,'lxml')
table = soup.find('table')
table_rows = table.find_all('tr')

for tr in table_rows:
    print(tr)

>>>(prints html text for first two rows)


# Second method
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print (df)

>>> ASAS-SN Other ATEL  RA  Dec  ...   SDSS    DSS Vizier Spectroscopic Class Comments
0      ID   IDs  TNS NaN  NaN  ...  image  image   data                 NaN      NaN

[1 rows x 12 columns]

Sometimes people create incorrect HTML but web browsers may display this correctly because there don't respect strictly all rules.有时人们创建不正确的 HTML 但 web 浏览器可能会正确显示它,因为没有严格遵守所有规则。

BeautifulSoup can use three different parsers - lxml , html.parser and html5lib and they may parse broken HTML in different way. BeautifulSoup可以使用三种不同的解析器lxmlhtml.parserhtml5lib ,它们可能会以不同的方式解析损坏的 HTML。

soup = BeautifulSoup(source, 'lxml')
soup = BeautifulSoup(source, 'html.parser')
soup = BeautifulSoup(source, 'html5lib')

It seems lxml skips wrong elements and you get only two rows but html5lib can parse incorrect elements and gives all rows.看起来lxml跳过了错误的元素,你只得到两行,但html5lib可以解析不正确的元素并给出所有行。

soup = BeautifulSoup(source, 'html5lib')

BTW:顺便提一句:

In BeautifulSoup documentation in section installing a parser you can see table with comparition of these parsers (advantages/disadvantages).安装解析器部分的BeautifulSoup文档中,您可以看到这些解析器的比较表(优点/缺点)。

People may use lxml because it uses code in C/C++ so it can works faster.人们可能会使用lxml ,因为它使用 C/C++ 中的代码,因此它可以更快地工作。


BTW:顺便提一句:

pandas can read HTML table directly from url pandas可以直接从url读取HTML表

read_html(url)

In documentation you can see that it uses lxml as default parser but you may change it to bs4+html5lib文档中,您可以看到它使用lxml作为默认解析器,但您可以将其更改为bs4+html5lib

read_html(html, flavor="bs4")

read_html(url, flavor="bs4")

or要么

read_html(html, flavor="html5lib")

read_html(url, flavor="html5lib")

I tested it and flavor="html5lib" gives all rows.我测试了它并且flavor="html5lib"给出了所有行。

Check this out, it worked for me.看看这个,它对我有用。

import requests
url = 'http://www.astronomy.ohio-state.edu/asassn/transients.html'
df_list = pd.read_html(url, header=0, flavor='bs4')
df = df_list[0]
print (df)

I'm trying to scrape data from an html table on a web page.我正在尝试从 web 页面上的 html 表中抓取数据。 I've tried a few different methods based on answers posted here, but always getting a problem: the result is roughly what I expect but only for the first two rows of the table.我根据此处发布的答案尝试了几种不同的方法,但总是遇到问题:结果大致符合我的预期,但仅适用于表格的前两行。 I have little experience with html and beautiful soup, but from the html file of the table in the url I can't see any difference between the first two rows and the rest of the table. I have little experience with html and beautiful soup, but from the html file of the table in the url I can't see any difference between the first two rows and the rest of the table. Could anyone help me figure out what I'm doing wrong?谁能帮我弄清楚我做错了什么?

import numpy
import pandas as pd
import urllib
from bs4 import BeautifulSoup

url = 'http://www.astronomy.ohio-state.edu/asassn/transients.html'

# First method
source = urllib.request.urlopen(url).read()
soup = BeautifulSoup(source,'lxml')
table = soup.find('table')
table_rows = table.find_all('tr')

for tr in table_rows:
    print(tr)

>>>(prints html text for first two rows)


# Second method
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print (df)

>>> ASAS-SN Other ATEL  RA  Dec  ...   SDSS    DSS Vizier Spectroscopic Class Comments
0      ID   IDs  TNS NaN  NaN  ...  image  image   data                 NaN      NaN

[1 rows x 12 columns]

I think the problem here is the index.我认为这里的问题是索引。 Try this code for the second method为第二种方法尝试此代码

html = requests.get(url).content
df_list = pd.read_html(html, header=0)
df = df_list[0]
print (df)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM