简体   繁体   English

如何使用 XPATH 从 html 中提取文本

[英]How to extract text from html using XPATH

url <- "http://www.officedepot.com/a/browse/technology/N=5+9021/;jsessionid=00000a2ZDz-8D4MKY5wMPuithDX:17h4h7bfo"

library(RCurl)
library(XML)
html <- getURL(url[u])

trim <- function (x) gsub("^\\s+|\\s+$", "", x)
docs <- htmlParse(html, asText=TRUE)
data <-xpathApply(docs, "//*[not(self::script)]/text()",xmlValue)
data <- trim(gsub('\t|\n',' ',unlist(data)))
data <- data[data!='']
head(data)

Above code successfully extract all text from any url but along with text, I'm getting some style tag data上面的代码成功地从任何 url 中提取了所有文本,但与文本一起,我得到了一些样式标签数据

like, see below style tag喜欢,见下面的样式标签

<style>
    .dat_wrapper {
      visibility: hidden;
    }
    .cke_widget_element .dat_wrapper {
      visibility: visible;
    }

And extracted text from this tag using XPATH expresseion I mentioned above, see output of data[2]并使用我上面提到的 XPATH 表达式从此标记中提取文本,请参阅数据的输出 [2]

> data[2]

[1] ".dat_wrapper {visibility: hidden;} .cke_widget_element .dat_wrapper {visibility: visible;}" [1] “.dat_wrapper {可见性:隐藏;} .cke_widget_element .dat_wrapper {可见性:可见;}”

I do not want such data.我不想要这样的数据。 Please anybody help me to overcome this.请任何人帮助我克服这个问题。

I assume you want to extract all the information in "Technologies" section with detailed description of each product?我假设您想提取“技术”部分中的所有信息以及每个产品的详细说明?

If so, the solution would be straightforward, first parse the url's then extract the content.如果是这样,解决方案将很简单,首先解析 url,然后提取内容。 Now your code and inquiry doesn't make any sense.现在您的代码和查询没有任何意义。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM