简体   繁体   English

使用BeautifulSoup4抓取网页

[英]Scraping a webpage using BeautifulSoup4

I am trying to print the content of a news article using BeautifulSoup4. 我正在尝试使用BeautifulSoup4打印新闻文章的内容。

The URL is: Link URL是: 链接

The current code which I have is as follows which gives the desired output: 我当前的代码如下,给出了所需的输出:

page = requests.get('http://www.thehindu.com/news/national/People-showing-monumental-patience-queuing-up-for-a-better-India-says-Venkaiah/article16447029.ece')
soup = BeautifulSoup(page.content, 'html.parser')


article_text = ""
table = soup.find_all("div",{ "id": "content-body-14266949-16447029"})                              

for element in table:
    article_text += ''.join(element.find_all(text = True)) + "\n\n"

print(article_text)

However, the problem is I want to scrape multiple pages and each of them has a different content body number in the format xxxxxxxx-xxxxxxxx (2 blocks of 8 digits.) 但是,问题是我想抓取多个页面,并且每个页面都有一个不同的内容主体编号,格式为xxxxxxxx-xxxxxxxx(2个8位数字的块。)

I tried replacing the soup.find_all command with regex as: 我尝试用正则表达式替换soup.find_all命令为:

table = soup.find_all(text=re.compile("content-body-........-........")) 表格= soup.find_all(text = re.compile(“ content-body -........-........”))

but this gives an error: 但这给出了一个错误:

AttributeError: 'NavigableString' object has no attribute 'find_all' AttributeError:“ NavigableString”对象没有属性“ find_all”

Can anybody guide me to what needs to be done? 谁能指导我该做些什么?

Thank you. 谢谢。

you can use extract content by using lxml lxml library allow you use xpath to extract content from html 您可以通过使用lxml使用提取内容lxml库允许您使用xpath从html提取内容

from lxml import etree
selector=etree.HTML(pageText)
article_text=selector.xpath('//div[@class="article-block-multiple live-snippet"]/div[1]')[0].text

i don't use BeautifulSoup.I think you can use BeautifulSoup like this 我不使用BeautifulSoup。我想你可以这样使用BeautifulSoup

table = soup.find_all("div",{ "class": "article-block-multiple live-snippet"]"})

then use find child element ,find the first div element 然后使用find子元素,找到第一个div元素

Regular expressions should be fine! 正则表达式应该没问题! Try 尝试

table = soup.find_all("div",{ "id": re.compile('content-body-*')})

Another approach may be with css selector. 另一种方法可能是使用CSS选择器。 Selectors are clean and to the point. 选择器干净利落。 You might give it a try as well. 您也可以尝试一下。 Just change the "url" with your concerning link. 只需使用您的相关链接更改“ URL”即可。

import requests ; from bs4 import BeautifulSoup

res = requests.get(url).text
soup = BeautifulSoup(res,"html.parser")

for item in soup.select("div[id^=content-body-] p"):
    print(item.text)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM