简体   繁体   English

使用R的Web Scraping

[英]Web Scraping using R

I am trying to copy the list of hospitals, their addresses, and their phone numbers from Catholic Health Initiatives . 我试图复制天主教健康倡议的医院名单,他们的地址和电话号码。

The code I am using is: 我使用的代码是:

# install.packages('rvest')
library('rvest')
htmlpage <- read_html("http://www.catholichealthinitiatives.org/landing.cfm?xyzpdqabc=0&id=39524&action=list")
chihtml <- html_nodes(htmlpage,".info , .address")
chi <- html_text(chihtml)
chi
library(stringr)

chi <- str_replace_all(chi, "[\r\n\t]" , "")
chi

and this is the heading result: 这是标题结果:

 [1] "CHI St. VincentTwo St. Vincent Cr.Little Rock, AR 72205P 501.552.3000F         501.552.4241"                                 
 [2] "Two St. Vincent Cr.Little Rock, AR 72205P 501.552.3000F 501.552.4241"                                                
 [3] "CHI St. Vincent Hot Springs300 Werner StreetHot Springs National Park,     AR 71913P 501.622.1000"                       
 [4] "300 Werner StreetHot Springs National Park, AR 71913P 501.622.1000"                                                  
 [5] "CHI St. Vincent InfirmaryTwo St. Vincent CircleLittle Rock, AR 72205P 502.552.3000F 501.552.4241"                    
 [6] "Two St. Vincent CircleLittle Rock, AR 72205P 502.552.3000F 501.552.4241"

I would like to remove the duplicated address found below the main line: 我想删除主线下方的重复地址:

[1] "CHI EX:   St. VincentTwo St. Vincent Cr.Little Rock, AR 72205P 501.552.3000F            501.552.4241"                                 
## remove next line ##
[2] "Two St. Vincent Cr.Little Rock, AR 72205P    501.552.3000F 501.552.4241"

Just specify .info or .address in html_nodes , depending on which you want: 只要指定.info .addresshtml_nodes ,取决于你想要的:

chihtml <- html_nodes(htmlpage,".info")
chi <- html_text(chihtml, trim = TRUE)    # `trim = TRUE` to strip whitespace
head(chi)
# [1] "CHI St. Vincent\nTwo St. Vincent Cr.Little Rock, AR 72205P 501.552.3000F 501.552.4241"                      
# [2] "CHI St. Vincent Hot Springs\n300 Werner StreetHot Springs National Park, AR 71913P 501.622.1000"            
# [3] "CHI St. Vincent Infirmary\nTwo St. Vincent CircleLittle Rock, AR 72205P 502.552.3000F 501.552.4241"         
# [4] "CHI St. Vincent Morrilton\nFour Hospital DriveMorrilton, AR 72110P 501.977.2300F 501.977.2400"              
# [5] "CHI St. Vincent North\n2215 Wildwood AvenueSherwood, AR 72120P 501.977.2300F 501.977.2400"                  
# [6] "CHI St. Vincent Rehabilitation Hospital\n2201 Wildwood AvenueSherwood, AR 72120P 501.834.1800F 501.834.2227"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM