[英]Xpath to extract article text from webpage
我将对该网站上的文章进行网络爬虫。
到目前为止,这是我所做的:
# HR Version
# the entire crawling process
openfile = open("data/HR.csv", "rb")
r = csv.reader(openfile)
HR_data = []
for i in r:
url = i[0]
print url # to know the status of web crawling
r = requests.get(url)
data = html.fromstring(r.text)
#Inspect line with text
#//*[@id="article-details"]
#<section class="entry-content clearfix" itemprop="articleBody"></section>
texts = data.xpath("//*[@id="article-details"]/p/text()")
raw = ''.join(str(i.encode("utf-8")) for i in texts)
finaldata = raw.replace('\r','').replace('\n','').replace('\r','').replace('\t','')
HR_data.append([finaldata])
openfile.close()
有问题的命令如下
texts = data.xpath("//*[@id="article-details"]/p/text()")
它来自以下特定网页: http : //hrmagazine.co.uk/article-details/internal-entrepreneurship-can-boost-your-business
在Firefox上使用Inspect Element,我发现“文本”位于以下部分的以下部分中:
<article id="article-details">
#One <h2> element, followed by multiple <p> elements.
</article>
什么是仅从文章中提取段落文本的正确XPath?
您几乎编写了正确的XPath。 您需要在h2
上替换p
texts = data.xpath("//*[@id="article-details"]/h2/text()")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.