简体   繁体   中英

Web Scraping table in R only gives header

Problem: Trying to access the #gene_regulation_table from the following site, which in the inspector is labelled as a table. But instead i'm only getting the header instead of the actual table.

What I've tried:

library(xml2)
library(httr)
library(XML)
url <- 'http://rna.sysu.edu.cn/chipbase/regulator_browse.php?organism=human&assembly=hg38&ref_gene_id=ENSG00000105835.11&gene_symbol=NAMPT#0'
website <- read_html(url) 
table_node <- html_node(website, "#gene_regulation_table")
table <- html_table(table_node)

#Same exact problem happens with 
tables <- getNodeSet(htmlParse(url), "//table")
xt <- readHTMLTable(tables[[2]])

So I'm definetely doing something wrong.

Any help welcome!

The table on the page is empty in the initial html sent by the server. The table is then populated by a javascript XHR request made by your browser, which returns a json string. You can replicate this by using the httr::POST function, although you need to know all the form parameters. In your case, I have put them all into a list here:

 form_body <- list(draw = "1", `columns[0][data]` = "protein", `columns[0][name]` = "", 
    `columns[0][searchable]` = "true", `columns[0][orderable]` = "true", 
    `columns[0][search][value]` = "", `columns[0][search][regex]` = "false", 
    `columns[1][data]` = "synonyms", `columns[1][name]` = "", 
    `columns[1][searchable]` = "true", `columns[1][orderable]` = "true", 
    `columns[1][search][value]` = "", `columns[1][search][regex]` = "false", 
    `columns[2][data]` = "protein_full_name", `columns[2][name]` = "", 
    `columns[2][searchable]` = "true", `columns[2][orderable]` = "true", 
    `columns[2][search][value]` = "", `columns[2][search][regex]` = "false", 
    `columns[3][data]` = "upstream_sample_motif_hits", `columns[3][name]` = "", 
    `columns[3][searchable]` = "true", `columns[3][orderable]` = "true", 
    `columns[3][search][value]` = "", `columns[3][search][regex]` = "false", 
    `columns[4][data]` = "downstream_sample_motif_hits", `columns[4][name]` = "", 
    `columns[4][searchable]` = "true", `columns[4][orderable]` = "true", 
    `columns[4][search][value]` = "", `columns[4][search][regex]` = "false", 
    `columns[5][data]` = "upstream_motif", `columns[5][name]` = "", 
    `columns[5][searchable]` = "true", `columns[5][orderable]` = "true", 
    `columns[5][search][value]` = "", `columns[5][search][regex]` = "false", 
    `columns[6][data]` = "downstream_motif", `columns[6][name]` = "", 
    `columns[6][searchable]` = "true", `columns[6][orderable]` = "true", 
    `columns[6][search][value]` = "", `columns[6][search][regex]` = "false", 
    `order[0][column]` = "3", `order[0][dir]` = "desc", `order[1][column]` = "0", 
    `order[1][dir]` = "asc", start = "0", length = "10", `search[value]` = "", 
    `search[regex]` = "false", assembly = "hg38", ref_gene_id = "ENSG00000105835.11", 
    regulator_type = "tf", upstream = "1kb", downstream = "1kb", 
    motif_status = "Y", sample_flag = "0")

So now you can do

form_url <- "http://rna.sysu.edu.cn/chipbase/php/get_gene_search_symbol_info.php"
result_json <- httr::content(httr::POST(form_url, body = form_body), "text")

and it's easy to parse the json with an R package such as jsonlite to get a nice dataframe containing all the info you want:

df <- jsonlite::fromJSON(result_json)
dplyr::as_tibble(df$data)
#> # A tibble: 10 x 7
#>    protein synonyms protein_full_na~ upstream_sample~ downstream_samp~ upstream_motif
#>    <chr>   <chr>    <chr>            <chr>            <chr>            <chr>         
#>  1 FOXA1   HNF3A, ~ forkhead box A1  2                0                2             
#>  2 HNF4A   FRTS4, ~ hepatocyte nucl~ 2                0                2             
#>  3 BARHL1  -        BarH-like homeo~ 0                1                0             
#>  4 BHLHE40 BHLHB2,~ basic helix-loo~ 0                1                0             
#>  5 CAMTA2  -        calmodulin bind~ 0                1                0             
#>  6 CDX2    CDX-3, ~ caudal type hom~ 0                1                0             
#>  7 CREB1   CREB     cAMP responsive~ 0                2                0             
#>  8 CTCF    MRD21    CCCTC-binding f~ 0                16               0             
#>  9 E2F1    E2F-1, ~ E2F transcripti~ 0                1                0             
#> 10 E2F3    E2F-3    E2F transcripti~ 0                1                0             
#> # ... with 1 more variable: downstream_motif <chr>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM