简体   繁体   中英

web scraping with middle page in R

I need to do a web-scraping of the webpage below. I need to submit the form with some specific values. After submitting the form, I need to import the data into the R (table of the link "View results as text file") in a data.frame. I tried to make the submission using the following code, but I did not get results:

library(rvest)
library(httr)

POST(
  url = "http://tempest.wellesley.edu/~btjaden/TargetRNA2/advanced.html",
  encode = "form",
  body=list(
    `text` = "Escherichia coli str. K-12 substr. MG1655",
    `sequence` = ">RyhB GCGATCAGGAAGACCCTCGCGGAGAACCTGAAAGCACGACATTGCTCACATTGCTTCCAGTATTACTTAGCCAGCCGGGTGCTGGCTTTT",
    `sRNA_subregions` = "on",
    `window` = "13",
    `before` = "80",
    `after` = "20",
    `seed` = "7",
    `interaction_region` = "20",
    `candidate_targets` = "",
    `mRNA_accessibility` = "on",
    `sigle_target` = "",
    `pvalue`= "0.05",
    `max_interactions`="400"
  ),
  verbose()
) -> res
content(res, as="parsed")

I know there is an intermediate page I think there is an intermediate page before loading the results http://tempest.wellesley.edu/~btjaden/cgi-bin/processRequest2.cgi before , I do not know the parameters of this intermediate page. So I can not get results. I want to get this table ( http://tempest.wellesley.edu/~btjaden/cgi-bin/targetRNA2.cgi?t1519754493.26 ):

Rank    Gene    Synonym Energy  Pvalue  sRNA_start  sRNA_stop   mRNA_start  mRNA_stop
1   sdhD    b0722   -12.98  0.004       28          42          -34         -20
2   ascG    b2714   -12.65  0.005       52          65          8           20
3   ygjH    b3074   -12.24  0.006       45          59          -8          6
4   sodB    b1656   -11.43  0.011       37          50          -7          6
5   acnA    b1276   -11.14  0.013       33          48          -6          9
6   srlQ    b2708   -10.79  0.015       34          48          -6          8
7   cirA    b2155   -10.71  0.016       40          57          -58         -40
8   nirB    b3365   -10.51  0.018       37          55          -6          13
9   djlB    b0646   -10.41  0.019       53          63          9           19
10  shiA    b1981   -9.96   0.024       43          58          -63         -47
11  yhhN    b3468   -9.78   0.026       50          62          -61         -49
12  ybbP    b0496   -9.45   0.030       48          59          -7          4
13  ssuD    b0935   -9.43   0.031       50          62          -19         -7
14  cysE    b3607   -8.99   0.037       33          49          -8          10
15  insH1   b2030   -8.86   0.039       29          39          -75         -65
16  hscA    b2526   -8.82   0.040       52          66          -20         -5
17  yciS    b1279   -8.69   0.043       45          59          -10         5
18  dhaL    b1199   -8.63   0.044       37          50          -8          6
19  nuoA    b2288   -8.6    0.044       42          59          -8          8
20  narG    b1224   -8.47   0.047       36          47          -51         -40
21  yraK    b3145   -8.37   0.049       27          41          -80         -68

The POST should go to the processRequest2.cgi endpoint:

library(rvest)
library(httr)

POST(
  url = "http://tempest.wellesley.edu/~btjaden/cgi-bin/processRequest2.cgi",
  encode = "form",
  body=list(
    `text` = "Escherichia coli str. K-12 substr. MG1655",
    `sequence` = ">RyhB GCGATCAGGAAGACCCTCGCGGAGAACCTGAAAGCACGACATTGCTCACATTGCTTCCAGTATTACTTAGCCAGCCGGGTGCTGGCTTTT",
    `sRNA_subregions` = "on",
    `window` = "13",
    `before` = "80",
    `after` = "20",
    `seed` = "7",
    `interaction_region` = "20",
    `candidate_targets` = "",
    `mRNA_accessibility` = "on",
    `sigle_target` = "",
    `pvalue`= "0.05",
    `max_interactions`="400"
  ),
  verbose()
) -> res

After that, you can look for the URL that it eventually redirects you to:

content(res, as="parsed") %>% 
  html_node(xpath=".//meta[@http-equiv]") %>% 
  html_attr("content") %>% 
  strsplit("=") %>% 
  .[[1]] %>% 
  .[2] %>% 
  sprintf("http://tempest.wellesley.edu/~btjaden/cgi-bin/%s", .) -> target_url

The site says wait 6 seconds:

Sys.sleep(6)

Then you can get the data:

pg <- read_html(target_url)

html_nodes(pg, "table")
## {xml_nodeset (89)}
##  [1] <table><tr>\n<td align="left"><code>GCGATCAGGAAGACCCTCGCGGAGAACCTGAAAGCACGAC< ...
##  [2] <table width="800">\n<tr>\n<th align="center">Rank</th>\n  <th align="center" ...
##  [3] <table width="355"><tr>\n<td align="left">1</td>\n      <td width="90%">\n    ...
##  [4] <table width="100%">\n<tr><td></td></tr>\n<tr><td width="100%" bgcolor="white ...
##  [5] <table width="355"><tr>\n<td width="32%"> </td>\n      <td bgcolor="1E90FF">  ...
##  [6] <table width="355"><tr>\n<td width="56%"> </td>\n      <td bgcolor="1E90FF">  ...
##  [7] <table width="355"><tr>\n<td width="49%"> </td>\n      <td bgcolor="1E90FF">  ...
##  [8] <table width="355"><tr>\n<td width="41%"> </td>\n      <td bgcolor="1E90FF">  ...
##  [9] <table width="355"><tr>\n<td width="37%"> </td>\n      <td bgcolor="1E90FF">  ...
## [10] <table width="355"><tr>\n<td width="38%"> </td>\n      <td bgcolor="1E90FF">  ...
## [11] <table width="355"><tr>\n<td width="44%"> </td>\n      <td bgcolor="1E90FF">  ...
## [12] <table width="355"><tr>\n<td width="41%"> </td>\n      <td bgcolor="1E90FF">  ...
## [13] <table width="355"><tr>\n<td width="57%"> </td>\n      <td bgcolor="1E90FF">  ...
## [14] <table width="355"><tr>\n<td width="47%"> </td>\n      <td bgcolor="1E90FF">  ...
## [15] <table width="355"><tr>\n<td width="54%"> </td>\n      <td bgcolor="1E90FF">  ...
## [16] <table width="355"><tr>\n<td width="52%"> </td>\n      <td bgcolor="1E90FF">  ...
## [17] <table width="355"><tr>\n<td width="54%"> </td>\n      <td bgcolor="1E90FF">  ...
## [18] <table width="355"><tr>\n<td width="37%"> </td>\n      <td bgcolor="1E90FF">  ...
## [19] <table width="355"><tr>\n<td width="33%"> </td>\n      <td bgcolor="1E90FF">  ...
## [20] <table width="355"><tr>\n<td width="56%"> </td>\n      <td bgcolor="1E90FF">  ...
## ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM