beautifulSoup soup.select() 为 css 选择器返回空

Question

I am trying to parse some links from this site https://news.ycombinator.com/我正在尝试解析来自该站点https://news.ycombinator.com/的一些链接

I want to select a specific table我想 select 一个特定的表

document.querySelector("#hnmain > tbody > tr:nth-child(3) > td > table")

I know there css selector limitations for bs4.我知道 bs4 的 css 选择器限制。 But the problem is I can't even select as simple as #hnmain > tbody with soup.select('#hnmain > tbody') as it is returning empty但问题是我什至不能像#hnmain > tbody with soup.select('#hnmain > tbody')这样简单的 select 因为它返回空

with below code, I'm unable to parse tbody whereas the with js I did (screenshot)使用下面的代码，我无法解析 tbody 而使用 js 我做了（截图）

from bs4 import BeautifulSoup
import requests
print("-"*100)
print("Hackernews parser")
print("-"*100)
url="https://news.ycombinator.com/"
res=requests.get(url)
html=res.content
soup=BeautifulSoup(html)
table=soup.select('#hnmain > tbody')
print(table)

OUT:出去：

soup=BeautifulSoup(html)
[]

Answer 1

Instead of going to through the body and table why not go directly to the links?而不是通过正文和表格为什么不直接到链接 go 呢？ I tested this and it worked well:我对此进行了测试，效果很好：

links=soup.select('a',{'class':'storylink'})

If you want the table, since there is only one per page you don't need to go through the other elements either - you can go straight to it.如果你想要表格，因为每页只有一个，你不需要 go 通过其他元素 - 你可以 go 直接到它。

table = soup.select('table')

Answer 2

I am not getting the html tag tbody from beautifulsoup or the curl script.我没有从 beautifulsoup 或 curl 脚本中获得 html 标签tbody 。 It means它的意思是

soup.select('tbody')

returns empty list.返回空列表。 This is the same reason for you to get an empty list.这与您获得空列表的原因相同。

To just extract the links you are looking for just do要提取您正在寻找的链接，只需执行

soup.select("a.storylink")

It will get the links that you want from the site.它将从站点获取您想要的链接。

Answer 3

Data is arranged in groups of 3 rows where the third row is an empty row used for spacing.数据以 3 行为一组排列，其中第三行是用于间隔的空行。 Loop the top rows and use next_sibling to grab the associated second row at each point.循环顶部行并使用 next_sibling 在每个点获取关联的第二行。 bs4 4.7.1+ BS4 4.7.1+

from bs4 import BeautifulSoup as bs
import requests

r = requests.get('https://news.ycombinator.com/')
soup = bs(r.content, 'lxml')
top_rows = soup.select('.athing')

for row in top_rows:
    title = row.select_one('.storylink')
    print(title.text)
    print(title['href'])
    print('https://news.ycombinator.com/' + row.select_one('.sitebit a')['href'])
    next_row = row.next_sibling
    print(next_row.select_one('.score').text)
    print(next_row.select_one('.hnuser').text)
    print(next_row.select_one('.age a').text)
    print(next_row.select_one('a:nth-child(6)').text)
    print(100*'-')

beautifulSoup soup.select() 为 css 选择器返回空

问题描述

3 个解决方案

解决方案1
1 2019-10-20 04:46:08

解决方案2
1 2019-10-20 05:09:43

解决方案3
1 2019-10-20 08:31:07

beautifulSoup soup.select() 为 css 选择器返回空

问题描述

3 个解决方案

解决方案1 1 2019-10-20 04:46:08

解决方案2 1 2019-10-20 05:09:43

解决方案3 1 2019-10-20 08:31:07

解决方案1
1 2019-10-20 04:46:08

解决方案2
1 2019-10-20 05:09:43

解决方案3
1 2019-10-20 08:31:07