在R中使用XML刮HTML表

Question

I am trying to scrape text values from a website. 我正在尝试从网站上抓取文本值。 I have been able to parse the url. 我已经能够解析网址。 I am new to XPath in R. So I am not sure how to pull out all the text values that has tag as 我是R中的XPath的新手，所以我不确定如何提取所有带有标签为的文本值。

'<p class="MsoNormal" align="justify"> text </p>.'

How do I specify the path to the the specific tag and get the text value. 如何指定特定标签的路径并获取文本值。 This is what I am trying right now. 这就是我现在正在尝试的。

pizzaraw<-xpathSApply(pizzadoc, "//p[@class='MsoNormal']", xmlValue)

Is this the right approach. 这是正确的方法吗？ R seems not responding to the code. R似乎没有响应该代码。

Answer 1

Its difficult to know what is wrong given that your example is not self-contained but here is a self-contained one that works: 鉴于您的示例不是自包含的，但这里有一个有效的自包含示例，因此很难知道出了什么问题：

Lines <- '<html>
<p class="MsoNormal" align="justify"> text </p>
</html>
'

library(XML)
root <- htmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)
doc <- xmlRoot(root)
xpathSApply(doc, '//p[@class="MsoNormal"]', xmlValue, trim = TRUE)
## [1] "text"

在R中使用XML刮HTML表

问题描述

1 个解决方案

解决方案1
1 已采纳 2014-04-17 20:58:00

在R中使用XML刮HTML表

问题描述

1 个解决方案

解决方案1 1 已采纳 2014-04-17 20:58:00

解决方案1
1 已采纳 2014-04-17 20:58:00