[英]Web scraping with R, content
I am just starting with web scraping in R, I put this code: 我刚开始在R中进行网页抓取,我把这段代码:
mps <- read_html("http://tunisie-annonce.com/AnnoncesImmobilier.asp")
mps %>%
html_nodes("tr") %>%
html_text()
To get the needed content that I put in a text file. 获取我放在文本文件中所需的内容。 My problem is that I want to eliminate these red points, but I can't.
我的问题是我想消除这些红点,但我不能。 Could you please help me?
请你帮助我好吗? I think these points are replacing
<b>
and <br>
in the html code. 我认为这些要点正在替换html代码中的
<b>
和<br>
。
Whoever constructed that page very frustratingly assembled the table within a table, but not defined as a <table>
tag itself, so it's easiest to redefine it so it will parse more easily: 构建该页面的人非常令人沮丧地将表组装在表中,但未定义为
<table>
标记本身,因此最简单地重新定义它以便更容易解析:
library(rvest)
mps <- read_html("http://tunisie-annonce.com/AnnoncesImmobilier.asp")
df <- mps %>%
html_nodes("tr.Entete1, tr.Tableau1") %>% # get correct rows
paste(collapse = '\n') %>% # paste nodes back to a single string
paste('<table>', ., '</table>') %>% # add enclosing table node
read_html() %>% # reread as HTML
html_node('table') %>%
html_table(fill = TRUE) %>% # parse as table
{ setNames(.[-1,], make.names(.[1,], unique = TRUE)) } # grab names from first row
head(df)
#> X Région NA. Nature NA..1 Type NA..2
#> 2 Prix <NA> NA <NA> NA <NA> NA
#> 3 Modifiée NA <NA> NA <NA> NA
#> 4 Kelibia NA Terrain NA Terrain nu NA
#> 5 Cite El Ghazala NA Location NA App. 4 pièc NA
#> 6 Le Bardo NA Location NA App. 1 pièc NA
#> 7 Le Bardo NA Location vacance NA App. 3 pièc NA
#> Texte.annonce NA..3 Prix Prix.1 X.1 Modifiée
#> 2 <NA> NA <NA> <NA> <NA> <NA>
#> 3 <NA> NA <NA> <NA> <NA> <NA>
#> 4 Terrain a 5 km de kelibi NA 80 000 07/05/2017
#> 5 S plus 3 haut standing c NA 790 07/05/2017
#> 6 Appartements meubles NA 40 000 07/05/2017
#> 7 Un bel appartement au bardo m NA 420 07/05/2017
#> Modifiée.1 NA..4 NA..5
#> 2 <NA> NA NA
#> 3 <NA> NA NA
#> 4 <NA> NA NA
#> 5 <NA> NA NA
#> 6 <NA> NA NA
#> 7 <NA> NA NA
Note there's a lot of NA
s and other cruft here yet to be cleaned up, but at least it's usable at this point. 请注意,这里有很多
NA
和其他残骸尚待清理,但至少它在这一点上是可用的。
您始终可以使用正则表达式删除不需要的字符,例如,
mps <- gsub("•", " ", mps)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.