简体   繁体   English

beautifulSoup soup.select() 为 css 选择器返回空

[英]beautifulSoup soup.select() returning empty for css selector

I am trying to parse some links from this site https://news.ycombinator.com/我正在尝试解析来自该站点https://news.ycombinator.com/的一些链接

I want to select a specific table我想 select 一个特定的表

document.querySelector("#hnmain > tbody > tr:nth-child(3) > td > table")

I know there css selector limitations for bs4.我知道 bs4 的 css 选择器限制。 But the problem is I can't even select as simple as #hnmain > tbody with soup.select('#hnmain > tbody') as it is returning empty但问题是我什至不能像#hnmain > tbody with soup.select('#hnmain > tbody')这样简单的 select 因为它返回

with below code, I'm unable to parse tbody whereas the with js I did (screenshot)使用下面的代码,我无法解析 tbody 而使用 js 我做了(截图)

from bs4 import BeautifulSoup
import requests
print("-"*100)
print("Hackernews parser")
print("-"*100)
url="https://news.ycombinator.com/"
res=requests.get(url)
html=res.content
soup=BeautifulSoup(html)
table=soup.select('#hnmain > tbody')
print(table)

OUT:出去:

soup=BeautifulSoup(html)
[]

截屏

Instead of going to through the body and table why not go directly to the links?而不是通过正文和表格为什么不直接到链接 go 呢? I tested this and it worked well:我对此进行了测试,效果很好:

links=soup.select('a',{'class':'storylink'})

If you want the table, since there is only one per page you don't need to go through the other elements either - you can go straight to it.如果你想要表格,因为每页只有一个,你不需要 go 通过其他元素 - 你可以 go 直接到它。

table = soup.select('table')

I am not getting the html tag tbody from beautifulsoup or the curl script.我没有从 beautifulsoup 或 curl 脚本中获得 html 标签tbody It means它的意思是

soup.select('tbody')

returns empty list.返回空列表。 This is the same reason for you to get an empty list.这与您获得空列表的原因相同

To just extract the links you are looking for just do要提取您正在寻找的链接,只需执行

soup.select("a.storylink")

It will get the links that you want from the site.它将从站点获取您想要的链接。

Data is arranged in groups of 3 rows where the third row is an empty row used for spacing.数据以 3 行为一组排列,其中第三行是用于间隔的空行。 Loop the top rows and use next_sibling to grab the associated second row at each point.循环顶部行并使用 next_sibling 在每个点获取关联的第二行。 bs4 4.7.1+ BS4 4.7.1+

from bs4 import BeautifulSoup as bs
import requests

r = requests.get('https://news.ycombinator.com/')
soup = bs(r.content, 'lxml')
top_rows = soup.select('.athing')

for row in top_rows:
    title = row.select_one('.storylink')
    print(title.text)
    print(title['href'])
    print('https://news.ycombinator.com/' + row.select_one('.sitebit a')['href'])
    next_row = row.next_sibling
    print(next_row.select_one('.score').text)
    print(next_row.select_one('.hnuser').text)
    print(next_row.select_one('.age a').text)
    print(next_row.select_one('a:nth-child(6)').text)
    print(100*'-')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 BeautifulSoup类型的nth返回空列表。 Soup.select()[n -1]返回元素。 为什么? - BeautifulSoup nth-of-type returns empty list. Soup.select()[n -1] returns the elements. Why? soup.select('.r a') 返回一个空列表? - soup.select('.r a') returns an empty list? soup.select() 返回一个空列表 - soup.select() returns an empty list soup.select css nth of type? - soup.select css nth of type? 是否可以使用2种不同的BeautifulSoup汤。选择其中一种进行循环吗? - Is it possible to use 2 different BeautifulSoup soup.select in one for loop? 如何使用 BeautifulSoup soup.select 获取元标记值 - How to get meta tag value with BeautifulSoup soup.select BeautifulSoup 汤.select 切断子标签 - BeautifulSoup soup.select cutting off child tags soup.select 返回空列表。 需要帮助查找 css 代码的来源 - soup.select returns empty list. Need help finding the source of the css code BeautifulSoup输出编码:如何将汤.p.encode(“ utf-8”)与汤.select('a')和.getText()结合使用 - BeautifulSoup output encoding: how to combine soup.p.encode(“utf-8”) with soup.select('a') & .getText() Beautiful Soup 只使用soup.select() 返回前10 个列表,这可能是什么问题? - Beautiful Soup only returning the first 10 listings using soup.select(), What could be the issue here?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM