I am trying to parse some links from this site https://news.ycombinator.com/
I want to select a specific table
document.querySelector("#hnmain > tbody > tr:nth-child(3) > td > table")
I know there css selector limitations for bs4. But the problem is I can't even select as simple as #hnmain > tbody
with soup.select('#hnmain > tbody')
as it is returning empty
with below code, I'm unable to parse tbody whereas the with js I did (screenshot)
from bs4 import BeautifulSoup
import requests
print("-"*100)
print("Hackernews parser")
print("-"*100)
url="https://news.ycombinator.com/"
res=requests.get(url)
html=res.content
soup=BeautifulSoup(html)
table=soup.select('#hnmain > tbody')
print(table)
OUT:
soup=BeautifulSoup(html)
[]
Instead of going to through the body and table why not go directly to the links? I tested this and it worked well:
links=soup.select('a',{'class':'storylink'})
If you want the table, since there is only one per page you don't need to go through the other elements either - you can go straight to it.
table = soup.select('table')
I am not getting the html tag tbody from beautifulsoup or the curl script. It means
soup.select('tbody')
returns empty list. This is the same reason for you to get an empty list.
To just extract the links you are looking for just do
soup.select("a.storylink")
It will get the links that you want from the site.
Data is arranged in groups of 3 rows where the third row is an empty row used for spacing. Loop the top rows and use next_sibling to grab the associated second row at each point. bs4 4.7.1+
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://news.ycombinator.com/')
soup = bs(r.content, 'lxml')
top_rows = soup.select('.athing')
for row in top_rows:
title = row.select_one('.storylink')
print(title.text)
print(title['href'])
print('https://news.ycombinator.com/' + row.select_one('.sitebit a')['href'])
next_row = row.next_sibling
print(next_row.select_one('.score').text)
print(next_row.select_one('.hnuser').text)
print(next_row.select_one('.age a').text)
print(next_row.select_one('a:nth-child(6)').text)
print(100*'-')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.