簡體   English   中英

使用 rvest 從clinicaltrials.gov 抓取數據表

[英]scraping data table from clinicaltrials.gov with rvest

當我在臨床試驗.gov 上輸入搜索詞時,我想抓取這個數據表。 具體來說,我想抓取您在此頁面上看到的表格: https ://clinicaltrials.gov/ct2/results?term=nivolumab+AND+Overall+Survival。 請參閱下面的屏幕截圖:

在此處輸入圖像描述

我已經嘗試過這段代碼,但我認為我沒有正確的 css 選擇器:

# create custom url
ctgov_url <- "https://clinicaltrials.gov/ct2/results?term=nivolumab+AND+Overall+Survival"
# read HTML page
ct_page <- rvest::read_html(ctgov_url)

# extract related terms
ct_page %>%
  # find elements that match a css selector
  rvest::html_element("t") %>%
  # retrieve text from element (html_text() is much faster than html_text2())
  rvest::html_table()

你根本不需要rvest 該頁面提供了一個下載按鈕來獲取搜索項的 csv。 這有一個基本的 url 編碼的 GET 語法,它允許您創建一個簡單的小 API:

get_clin_trials_data <- function(terms, n = 1000) {
  
  terms<- URLencode(paste(terms, collapse = " AND "))

  df <- read.csv(paste0(
    "https://clinicaltrials.gov/ct2/results/download_fields",
    "?down_count=", n, "&down_flds=shown&down_fmt=csv",
    "&term=", terms, "&flds=a&flds=b&flds=y"))

  dplyr::as_tibble(df)
}

這允許您傳入搜索詞向量和要返回的最大結果數。 不需要像網絡抓取那樣復雜的解析。

get_clin_trials_data(c("nivolumab", "Overall Survival"), n = 10)
#> # A tibble: 10 x 8
#>     Rank Title     Status Study.Results Conditions Interventions Locations URL  
#>    <int> <chr>     <chr>  <chr>         <chr>      <chr>         <chr>     <chr>
#>  1     1 A Study ~ Compl~ No Results A~ Hepatocel~ ""            "Bristol~ http~
#>  2     2 Nivoluma~ Activ~ No Results A~ Glioblast~ "Drug: Nivol~ "Duke Un~ http~
#>  3     3 Nivoluma~ Unkno~ No Results A~ Melanoma   "Biological:~ "CHU d'A~ http~
#>  4     4 Study of~ Compl~ Has Results   Advanced ~ "Biological:~ "Highlan~ http~
#>  5     5 A Study ~ Unkno~ No Results A~ Brain Met~ "Drug: Fotem~ "Medical~ http~
#>  6     6 Trial of~ Compl~ Has Results   Squamous ~ "Drug: Nivol~ "Stanfor~ http~
#>  7     7 Nivoluma~ Compl~ No Results A~ MGMT-unme~ "Drug: Nivol~ "New Yor~ http~
#>  8     8 Study of~ Compl~ Has Results   Squamous ~ "Biological:~ "Mayo Cl~ http~
#>  9     9 Study of~ Compl~ Has Results   Non-Squam~ "Biological:~ "Mayo Cl~ http~
#> 10    10 An Open-~ Unkno~ No Results A~ Squamous-~ "Drug: Nivol~ "IRCCS -~ http~

reprex 包於 2022-06-21 創建 (v2.0.1)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM