使用BeautifulSoup4抓取网页

Question

I am trying to print the content of a news article using BeautifulSoup4. 我正在尝试使用BeautifulSoup4打印新闻文章的内容。

The URL is: Link URL是：链接

The current code which I have is as follows which gives the desired output: 我当前的代码如下，给出了所需的输出：

page = requests.get('http://www.thehindu.com/news/national/People-showing-monumental-patience-queuing-up-for-a-better-India-says-Venkaiah/article16447029.ece')
soup = BeautifulSoup(page.content, 'html.parser')


article_text = ""
table = soup.find_all("div",{ "id": "content-body-14266949-16447029"})                              

for element in table:
    article_text += ''.join(element.find_all(text = True)) + "\n\n"

print(article_text)

However, the problem is I want to scrape multiple pages and each of them has a different content body number in the format xxxxxxxx-xxxxxxxx (2 blocks of 8 digits.) 但是，问题是我想抓取多个页面，并且每个页面都有一个不同的内容主体编号，格式为xxxxxxxx-xxxxxxxx（2个8位数字的块。）

I tried replacing the soup.find_all command with regex as: 我尝试用正则表达式替换soup.find_all命令为：

table = soup.find_all(text=re.compile("content-body-........-........")) 表格= soup.find_all（text = re.compile（“ content-body -........-........”））

but this gives an error: 但这给出了一个错误：

AttributeError: 'NavigableString' object has no attribute 'find_all' AttributeError：“ NavigableString”对象没有属性“ find_all”

Can anybody guide me to what needs to be done? 谁能指导我该做些什么？

Thank you. 谢谢。

Answer 1

you can use extract content by using lxml lxml library allow you use xpath to extract content from html 您可以通过使用lxml使用提取内容lxml库允许您使用xpath从html提取内容

from lxml import etree
selector=etree.HTML(pageText)
article_text=selector.xpath('//div[@class="article-block-multiple live-snippet"]/div[1]')[0].text

i don't use BeautifulSoup.I think you can use BeautifulSoup like this 我不使用BeautifulSoup。我想你可以这样使用BeautifulSoup

table = soup.find_all("div",{ "class": "article-block-multiple live-snippet"]"})

then use find child element ,find the first div element 然后使用find子元素，找到第一个div元素

Answer 2

Regular expressions should be fine! 正则表达式应该没问题！ Try 尝试

table = soup.find_all("div",{ "id": re.compile('content-body-*')})

Answer 3

Another approach may be with css selector. 另一种方法可能是使用CSS选择器。 Selectors are clean and to the point. 选择器干净利落。 You might give it a try as well. 您也可以尝试一下。 Just change the "url" with your concerning link. 只需使用您的相关链接更改“ URL”即可。

import requests ; from bs4 import BeautifulSoup

res = requests.get(url).text
soup = BeautifulSoup(res,"html.parser")

for item in soup.select("div[id^=content-body-] p"):
    print(item.text)

使用BeautifulSoup4抓取网页

问题描述

3 个解决方案

解决方案1
2 2017-08-29 04:37:08

解决方案2
2 已采纳 2017-08-29 05:22:25

解决方案3
1 2017-08-29 06:08:19

使用BeautifulSoup4抓取网页

问题描述

3 个解决方案

解决方案1 2 2017-08-29 04:37:08

解决方案2 2 已采纳 2017-08-29 05:22:25

解决方案3 1 2017-08-29 06:08:19

解决方案1
2 2017-08-29 04:37:08

解决方案2
2 已采纳 2017-08-29 05:22:25

解决方案3
1 2017-08-29 06:08:19