简体   繁体   中英

Web Scraping using rvest in R

I have been trying to scrap information from a url in R using the rvest package:

url <-'https://eprocure.gov.in/cppp/tendersfullview/id%3DNDE4MTY4MA%3D%3D/ZmVhYzk5NWViMWM1NTdmZGMxYWYzN2JkYTU1YmQ5NzU%3D/MTUwMjk3MTg4NQ%3D%3D'

but am not able to correctly identity the xpath even after using selector plugin.

The code i am using for fetching the first table is as follows:

detail_data <- read_html(url)
detail_data_raw <- html_nodes(detail_data, xpath='//*[@id="edit-t-
fullview"]/table[2]/tbody/tr[2]/td/table')
detail_data_fine <- html_table(detail_data_raw)

When i try the above code, the detail_data_raw results in {xml_nodeset (0)} and consequently detail_data_fine is an empty list()

The information i am interested in scrapping is under the headers:

Organisation Details

Tender Details

Critical Dates

Work Details

Tender Inviting Authority Details

Any help or ideas in what is going wrong and how to rectify it is welcome.

Your example URL isn't working for anyone, but if you're looking to get the data for a particular tender, then:

library(rvest)
library(stringi)
library(tidyverse)

pg <- read_html("https://eprocure.gov.in/mmp/tendersfullview/id%3D2262207")

html_nodes(pg, xpath=".//table[@class='viewtablebg']/tr/td[1]") %>% 
  html_text(trim=TRUE) %>% 
  stri_replace_last_regex("\ +:$", "") %>% 
  stri_replace_all_fixed(" ", "_") %>% 
  stri_trans_tolower() -> tenders_cols

html_nodes(pg, xpath=".//table[@class='viewtablebg']/tr/td[2]") %>% 
  html_text(trim=TRUE) %>% 
  as.list() %>% 
  set_names(tenders_cols) %>% 
  flatten_df() %>% 
  glimpse()
## Observations: 1
## Variables: 15
## $ organisation_name            <chr> "Delhi Jal Board"
## $ organisation_type            <chr> "State Govt. and UT"
## $ tender_reference_number      <chr> "Short NIT. No.20 (Item no.1) EE ...
## $ tender_title                 <chr> "Short NIT. No.20 (Item no.1)"
## $ product_category             <chr> "Civil Works"
## $ tender_fee                   <chr> "Rs.500"
## $ tender_type                  <chr> "Open/Advertised"
## $ epublished_date              <chr> "18-Aug-2017 05:15 PM"
## $ document_download_start_date <chr> "18-Aug-2017 05:15 PM"
## $ bid_submission_start_date    <chr> "18-Aug-2017 05:15 PM"
## $ work_description             <chr> "Replacement of settled deep sewe...
## $ pre_qualification            <chr> "Please refer Tender documents."
## $ tender_document              <chr> "https://govtprocurement.delhi.go...
## $ name                         <chr> "EXECUTIVE ENGINEER (NORTH)-II"
## $ address                      <chr> "EXECUTIVE ENGINEER (NORTH)-II\r\...

seems to work just fine w/o installing Python and using Selenium.

Have a look at 'dynamic webscraping'. Typically what happens is when you enter the url in your browser, it sends a get request to the host server. The host server builds an HTML page with all data in it and posts it back to you. In dynamic pages, the server just sends you a HTML template, which once you open, runs javascript in your browser, which then retrieves the data that populates the template.

I would recommend scraping this page using python and the Selenium library. Selenium library gives your program the ability to wait until the javascript has run in your browser and retrieved the data. See below a query I had on the same concept, and a very helpful reply

BeautifulSoup parser can't access html elements

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM