简体   繁体   English

如何从该网页中抓取文本?

[英]How to scrape text from this webpage?

I'm trying to scrape this HTML title我正在尝试抓取此 HTML 标题

<h2 id="p89" data-pid="89"><span id="page77" class="pageNum" data-no="77" data-before-text="77"></span>Tuesday, July&nbsp;30</h2>

from this website: https://wol.jw.org/en/wol/h/r1/lp-e从这个网站: https : //wol.jw.org/en/wol/h/r1/lp-e

My code:我的代码:

from bs4 import BeautifulSoup
import requests

url = requests.get('https://wol.jw.org/en/wol/h/r1/lp-e').text

soup = BeautifulSoup(url, 'lxml')

textodiario = soup.find('header')

dia = textodiario.h2.text
print(dia)

It should returns me today's day but it returns me a passed day: Wednesday, July 24它应该返回我今天的一天,但它返回我过去的一天: Wednesday, July 24

At the moment I don't have a PC to test, please double check for possible errors.目前我没有要测试的 PC,请仔细检查可能的错误。

You need the chromedriver for your platform too , put it in the same folder of the script.您也需要适用于您的平台chromedriver ,将其放在脚本的同一文件夹中。

My idea would be to use selenium to get the HTML and then parse it:我的想法是使用 selenium 来获取 HTML 然后解析它:

import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = "https://wol.jw.org/en/wol/h/r1/lp-e"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(chrome_options=options)
driver.get(url)
time.sleep(3)
page = driver.page_source
driver.quit()
soup = BeautifulSoup(page, 'html.parser')
textodiario = soup.find('header')
dia = textodiario.h2.text
print(dia)

The data is getting loaded asynchronously and the contents of the div are being changed.数据正在异步加载并且 div 的内容正在更改。 What you need is a selenium web driver to act alongside bs4.你需要的是一个 selenium web 驱动程序来与 bs4 一起工作。 在此处输入图片说明

I actually tried your code, and there's definitely something wrong with how the website/the code is grabbing data.我实际上尝试了你的代码,网站/代码获取数据的方式肯定有问题。 Because when I pipe the entirety of the URL text to a grep with July, it gives:因为当我将整个 URL 文本通过管道传输到七月的 grep 时,它会给出:

Wednesday, July 24
<h2 id="p71" data-pid="71"><span id="page75" class="pageNum" data-no="75" data-before-text="75"></span>Wednesday, July 24</h2>
<h2 id="p74" data-pid="74">Thursday, July 25</h2>
<h2 id="p77" data-pid="77">Friday, July 26</h2>

If I had to take a guess, the fact that they're keeping multiple dates under h2 probably doesn't help, but I have almost zero experience in web scraping.如果我不得不猜测,他们在 h2 下保留多个日期这一事实可能无济于事,但我在网络抓取方面的经验几乎为零。 And if you notice, July 30th isn't even in there, meaning that somewhere along the line your data is getting weird (as LazyCoder points out).如果你注意到了,7 月 30 日甚至不在那里,这意味着你的数据在这条线上的某个地方变得很奇怪(正如 LazyCoder 指出的那样)。

Hope that Selenium fixes your issue.希望 Selenium 可以解决您的问题。

Go to NetWork Tab and you will get the link.转到NetWork选项卡,您将获得链接。

https://wol.jw.org/wol/dt/r1/lp-e/2019/7/30

Here is the code.这是代码。

from bs4 import BeautifulSoup
headers = {'User-Agent':
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
session = requests.Session()
response = session.get('https://wol.jw.org/wol/dt/r1/lp-e/2019/7/30',headers=headers)
result=response.json()
data=result['items'][0]['content']
soup=BeautifulSoup(data,'html.parser')
print(soup.select_one('h2').text)

Output:输出:

Tuesday, July 30

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM