[英]How to Crawl a website within 2 layers in Python using XPath
So I am trying to crawl this website and an example url is https://www.rfa.org/cantonese/news/SARS-12312019075620.html
所以我试图爬取这个网站和一个例子 url 是https://www.rfa.org/cantonese/news/SARS-12312019075620.html
I am just trying to get the text, however, you can see that some of the text is under p
tag and some of them are just between br
.我只是想获取文本,但是,您可以看到一些文本位于p
标签下,而其中一些文本位于br
之间。 I do not want to get text description of the pictures so I cannot crawl everything.我不想获取图片的文字描述,所以我无法抓取所有内容。
This is what I have so far that only gets text under p
这就是我目前所拥有的,只能在p
下获取文本
//*//div[@id="storytext"]/p/text()
But how can I get every text but not description of pictures and other unnecessary information.但是我怎样才能得到每一个文字而不是图片的描述和其他不必要的信息。
So there are 2 layers.所以有2层。 The first is p
and the other one is text between br
.第一个是p
,另一个是br
之间的文本。 The description of the pictures are always in 3 layers.图片的描述总是在3层。
Assuming you're using LXML.假设您使用的是 LXML。 You should write a specific XPath (using axes):你应该写一个特定的 XPath (使用轴):
from lxml import html
import requests
page = requests.get('https://www.rfa.org/cantonese/news/SARS-12312019075620.html')
tree = html.fromstring(page.content)
news = tree.xpath('//div[@id="storytext"]//text()[normalize-space()][parent::div[@id="storytext"] or ancestor::p]')
print (news)
Output: Output:
Use regex to make final cleaning.使用正则表达式进行最终清理。
Urls used to test the XPath (it should work for all types of pages: news, features, talkshow, etc.):用于测试 XPath 的网址(它应该适用于所有类型的页面:新闻、专题、脱口秀等):
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.