简体   繁体   English

如何使用 R 从多个“div 类”(html) 中提取文本?

[英]How to extract text from a several "div class" (html) using R?

My goal is to extract info from this html page to create a database: https://drive.google.com/folderview?id=0B0aGd85uKFDyOS1XTTc2QnNjRmc&usp=sharing我的目标是从此 html 页面中提取信息以创建数据库: https ://drive.google.com/folderview?id=0B0aGd85uKFDyOS1XTTc2QnNjRmc&usp=sharing

One of the variables is the price of the apartments.变量之一是公寓的价格。 I've identified that some have the div class="row_price" code which includes the price (example A) but others don't have this code and therefore the price (example B).我发现有些有包含价格的div class="row_price"代码(示例 A),但其他一些没有此代码,因此没有价格(示例 B)。 Hence I would like that R could read the observations without the price as NA , otherwise it will mixed the database by giving the price from the observation that follows.因此,我希望 R 可以在没有价格的情况下读取观察结果NA ,否则它将通过给出随后的观察结果的价格来混合数据库。

Example A例一

<div class="listing_column listing_row_price">
    <div class="row_price">
      $ 14,800
    </div>
<div class="row_info">Ayer&nbsp;19:53</div>

Example B例子二

<div class="listing_column listing_row_price">

<div class="row_info">Ayer&nbsp;19:50</div>

I think that if I extract the text from "listing_row_price" to the beginning of the "row_info" in a character vector I will be able to get my desired output, which is:我认为,如果我将文本从“listing_row_price”提取到字符向量中“row_info”的开头,我将能够获得所需的输出,即:

...
10 4000
11 14800
12 NA
13 14000
14 8000
...

But so far I've get this one and another full with NA .但到目前为止,我已经得到了这个和另一个完整的NA

...
10 4000
11 14800
12 14000
13 8000
14 8500
...

Commands used but didn't get what I want:使用的命令但没有得到我想要的:

    html1<-read_html("file.html")
    title<-html_nodes(html1,"div")
    html1<-toString(title)
    pattern1<-'div class="row_price">([^<]*)<'
    title3<-unlist(str_extract_all(title,pattern1))
    title3<-title3[c(1:35)]
    pattern2<-'>\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t([^<*]*)'
    title3<-unlist(str_extract(title3,pattern2))
    title3<-gsub(">\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t $ ","",title3,fixed=TRUE)
    title3<-as.data.frame(as.numeric(gsub(",","", title3,fixed=TRUE)))

I also try with pattern1<-'listing_row_price">([<div class="row_price">]?)([^<]*)< that I think it says to extract the "listing_row_price" part, then if exist extract the "row_price" part, later get the digits and finally extract the < thats follows.我也尝试使用pattern1<-'listing_row_price">([<div class="row_price">]?)([^<]*)<我认为它说要提取“listing_row_price”部分,然后如果存在则提取“row_price”部分,稍后获取数字,最后提取<后面的内容。

There are lots of ways to do this, and depending on how consistent the HTML is, one may be better than another.有很多方法可以做到这一点,并且取决于 HTML 的一致性,一个可能比另一个更好。 A reasonably simple strategy that works in this case, though:不过,在这种情况下,一个相当简单的策略是可行的:

library(rvest)

page <- read_html('page.html')

# find all nodes with a class of "listing_row_price"
listings <- html_nodes(page, css = '.listing_row_price')

# for each listing, if it has two children get the text of the first, else return NA
prices <- sapply(listings, function(x){ifelse(length(html_children(x)) == 2, 
                                              html_text(html_children(x)[1]), 
                                              NA)})
# replace everything that's not a number with nothing, and turn it into an integer
prices <- as.integer(gsub('[^0-9]', '', prices))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM