简体   繁体   English

网页抓取R,内容

[英]Web scraping with R, content

I am just starting with web scraping in R, I put this code: 我刚开始在R中进行网页抓取,我把这段代码:

mps <- read_html("http://tunisie-annonce.com/AnnoncesImmobilier.asp")

mps %>%
    html_nodes("tr") %>%
    html_text()

To get the needed content that I put in a text file. 获取我放在文本文件中所需的内容。 My problem is that I want to eliminate these red points, but I can't. 我的问题是我想消除这些红点,但我不能。 Could you please help me? 请你帮助我好吗? I think these points are replacing <b> and <br> in the html code. 我认为这些要点正在替换html代码中的<b><br>

在此输入图像描述

Whoever constructed that page very frustratingly assembled the table within a table, but not defined as a <table> tag itself, so it's easiest to redefine it so it will parse more easily: 构建该页面的人非常令人沮丧地将表组装在表中,但未定义为<table>标记本身,因此最简单地重新定义它以便更容易解析:

library(rvest)

mps <- read_html("http://tunisie-annonce.com/AnnoncesImmobilier.asp")

df <- mps %>%
    html_nodes("tr.Entete1, tr.Tableau1") %>%    # get correct rows
    paste(collapse = '\n') %>%     # paste nodes back to a single string
    paste('<table>', ., '</table>') %>%     # add enclosing table node
    read_html() %>%    # reread as HTML
    html_node('table') %>% 
    html_table(fill = TRUE) %>%    # parse as table
    { setNames(.[-1,], make.names(.[1,], unique = TRUE)) }    # grab names from first row

head(df)
#>          X          Région NA.           Nature NA..1        Type NA..2
#> 2     Prix            <NA>  NA             <NA>    NA        <NA>    NA
#> 3 Modifiée                  NA             <NA>    NA        <NA>    NA
#> 4                  Kelibia  NA          Terrain    NA  Terrain nu    NA
#> 5          Cite El Ghazala  NA         Location    NA App. 4 pièc    NA
#> 6                 Le Bardo  NA         Location    NA App. 1 pièc    NA
#> 7                 Le Bardo  NA Location vacance    NA App. 3 pièc    NA
#>                   Texte.annonce NA..3   Prix Prix.1        X.1 Modifiée
#> 2                          <NA>    NA   <NA>   <NA>       <NA>     <NA>
#> 3                          <NA>    NA   <NA>   <NA>       <NA>     <NA>
#> 4      Terrain a 5 km de kelibi    NA 80 000        07/05/2017         
#> 5      S plus 3 haut standing c    NA    790        07/05/2017         
#> 6          Appartements meubles    NA 40 000        07/05/2017         
#> 7 Un bel appartement au bardo m    NA    420        07/05/2017         
#>   Modifiée.1 NA..4 NA..5
#> 2       <NA>    NA    NA
#> 3       <NA>    NA    NA
#> 4       <NA>    NA    NA
#> 5       <NA>    NA    NA
#> 6       <NA>    NA    NA
#> 7       <NA>    NA    NA

Note there's a lot of NA s and other cruft here yet to be cleaned up, but at least it's usable at this point. 请注意,这里有很多NA和其他残骸尚待清理,但至少它在这一点上是可用的。

您始终可以使用正则表达式删除不需要的字符,例如,

mps <- gsub("•", " ", mps)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM