简体   繁体   中英

How to scrape text from this webpage?

I'm trying to scrape this HTML title

<h2 id="p89" data-pid="89"><span id="page77" class="pageNum" data-no="77" data-before-text="77"></span>Tuesday, July&nbsp;30</h2>

from this website: https://wol.jw.org/en/wol/h/r1/lp-e

My code:

from bs4 import BeautifulSoup
import requests

url = requests.get('https://wol.jw.org/en/wol/h/r1/lp-e').text

soup = BeautifulSoup(url, 'lxml')

textodiario = soup.find('header')

dia = textodiario.h2.text
print(dia)

It should returns me today's day but it returns me a passed day: Wednesday, July 24

At the moment I don't have a PC to test, please double check for possible errors.

You need the chromedriver for your platform too , put it in the same folder of the script.

My idea would be to use selenium to get the HTML and then parse it:

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = "https://wol.jw.org/en/wol/h/r1/lp-e"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
textodiario = soup.find('header')
dia = textodiario.h2.text
print(dia)

The data is getting loaded asynchronously and the contents of the div are being changed. What you need is a selenium web driver to act alongside bs4. 在此处输入图片说明

I actually tried your code, and there's definitely something wrong with how the website/the code is grabbing data. Because when I pipe the entirety of the URL text to a grep with July, it gives:

Wednesday, July 24
<h2 id="p71" data-pid="71"><span id="page75" class="pageNum" data-no="75" data-before-text="75"></span>Wednesday, July 24</h2>
<h2 id="p74" data-pid="74">Thursday, July 25</h2>
<h2 id="p77" data-pid="77">Friday, July 26</h2>

If I had to take a guess, the fact that they're keeping multiple dates under h2 probably doesn't help, but I have almost zero experience in web scraping. And if you notice, July 30th isn't even in there, meaning that somewhere along the line your data is getting weird (as LazyCoder points out).

Hope that Selenium fixes your issue.

Go to NetWork Tab and you will get the link.

https://wol.jw.org/wol/dt/r1/lp-e/2019/7/30

Here is the code.

from bs4 import BeautifulSoup
headers = {'User-Agent':
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
session = requests.Session()
response = session.get('https://wol.jw.org/wol/dt/r1/lp-e/2019/7/30',headers=headers)
result=response.json()
data=result['items'][0]['content']
soup=BeautifulSoup(data,'html.parser')
print(soup.select_one('h2').text)

Output:

Tuesday, July 30

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM