简体   繁体   中英

beautifulSoup soup.select() returning empty for css selector

I am trying to parse some links from this site https://news.ycombinator.com/

I want to select a specific table

document.querySelector("#hnmain > tbody > tr:nth-child(3) > td > table")

I know there css selector limitations for bs4. But the problem is I can't even select as simple as #hnmain > tbody with soup.select('#hnmain > tbody') as it is returning empty

with below code, I'm unable to parse tbody whereas the with js I did (screenshot)

from bs4 import BeautifulSoup
import requests
print("-"*100)
print("Hackernews parser")
print("-"*100)
url="https://news.ycombinator.com/"
res=requests.get(url)
html=res.content
soup=BeautifulSoup(html)
table=soup.select('#hnmain > tbody')
print(table)

OUT:

soup=BeautifulSoup(html)
[]

截屏

Instead of going to through the body and table why not go directly to the links? I tested this and it worked well:

links=soup.select('a',{'class':'storylink'})

If you want the table, since there is only one per page you don't need to go through the other elements either - you can go straight to it.

table = soup.select('table')

I am not getting the html tag tbody from beautifulsoup or the curl script. It means

soup.select('tbody')

returns empty list. This is the same reason for you to get an empty list.

To just extract the links you are looking for just do

soup.select("a.storylink")

It will get the links that you want from the site.

Data is arranged in groups of 3 rows where the third row is an empty row used for spacing. Loop the top rows and use next_sibling to grab the associated second row at each point. bs4 4.7.1+

from bs4 import BeautifulSoup as bs
import requests

r = requests.get('https://news.ycombinator.com/')
soup = bs(r.content, 'lxml')
top_rows = soup.select('.athing')

for row in top_rows:
    title = row.select_one('.storylink')
    print(title.text)
    print(title['href'])
    print('https://news.ycombinator.com/' + row.select_one('.sitebit a')['href'])
    next_row = row.next_sibling
    print(next_row.select_one('.score').text)
    print(next_row.select_one('.hnuser').text)
    print(next_row.select_one('.age a').text)
    print(next_row.select_one('a:nth-child(6)').text)
    print(100*'-')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM