简体   繁体   中英

Can't validate HTML forms using R and R packages RCurl and RHTMLForms

I am trying to extract energy/water data with R from this Brazilian website : http://www.ons.org.br/historico/energia_natural_afluente.aspx

Nevermind trying to access the english version of the website since this link doesn't exist in the english version...

It is clearly seen that there are html forms for selection: "Região ou Bacia" (Region or basin), "Unidade de medida" (Unit of measurement) and "Período". The last form I don't need.

Also, you must select the forms in order from top to bottom, for example you can only select Unit of measurement after having selecting a Region or basin.

After you have selected the forms and press consultar a webpage should appear with a table and a graph. I'm interested in extracting only the table. The link of the webpage that appears is http://www.ons.org.br/historico/energia_natural_afluente_out.aspx so you have to validate those forms in order to get to the webpage.

At first I tried using the XML package and RHTMLForms (available at omegahat) but that didn't work out as shown below.

library(XML)
library(RCurl)
library(RHTMLForms)
x <- getHTMLFormDescription("http://www.ons.org.br/historico/energia_natural_afluente.aspx", 
encoding = "utf-8")

Examining the contents of x I found out that my forms of interest were located in x[[4]]:

 > x[[4]]
HTML Form: http://www.ons.org.br/historico/energia_natural_afluente_out.aspx 
passo1: [ -1 ]  -1, SE, S, NE, N, Grande, Paranaiba, Tiete, Paranapanema, Parana, Iguacu, Uruguai, Jacui, Capivari, Paraguai, Paraiba_do_sul, Doce, Itabapoana, São_francisco, Parnaiba, Tocantins, Amazonas, Selecione, Paranaíba, Tietê, Paraná, Iguaçu, Paraguai (a partir de 2001), Paraíba do Sul, Itabapoana (a partir de 2001), São Francisco, Parnaíba, Amazonas (a partir de 2001)
passo2a: [ -1 ]  -1, Selecione
passo2b: [ -1 ]  -1, MWmed, MLT, Selecione, %MLT
passo3a: [ -1 ]  -1, Selecione
passo3b: [ -1 ]  -1, 2016, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, Selecione
comparar: 1
passo4a: [ -1 ]  -1, Selecione
passo4b: [ -1 ]  -1, 2015, 2014, 2013, 2012, 2011, 2010, 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, Selecione

Using the createFunction from RHTMLForms:

fun1 <- createFunction(x[[4]])

And then I tried passing all of the arguments to the function:

X <- fun1(passo1 = "SE", passo2a = "-1", passo2b = "MWmed", passo3a = "-1", passo3b = "2014", comparar = "", passo4a = "-1", passo4b = "2009")

But examining the object XI did not obtain the webpage with a table and a graph as expected. I also tried changing some arguments but that also didn't work.

I also tried using getForm and postForm from the RCurl package:

test <- getForm("http://www.ons.org.br/historico/energia_natural_afluente.aspx", .params = list(passo1 = "NE", passo2b = "MWmed", passo3b = 2014))

teste2 <- postForm("http://www.ons.org.br/historico/energia_natural_afluente.aspx",.params = list(passo1 = "SE", passo2a = -1, passo2b = "MWmed", passo3a = -1, passo3b = 2014, comparar = 1, passo4a = -1, passo4b = 2009))

But that also didn't work unfortunately. Examing the html code in the I think post was more correct although I'm not very good at html...

Can someone please help me to webscrape the page?

If you can poke around at the values in the select popup menus (use Chrome/Safari/Firefox Developer Tools) you can just issue the POST request that it makes to get the values:

library(xml2)
library(httr)
library(rvest)

req <- POST(verb = "POST", 
     url = "http://www.ons.org.br/historico/energia_natural_afluente_out.aspx", 
     body = list(passo1 = "Jacui", 
                 passo2a = "-1", 
                 passo2b = "MLT", 
                 passo3a = "-1", 
                 passo3b = "2008", 
                 tipo = "bacia", 
                 passo2 = "MLT", 
                 passo3 = "2008", 
                 passo4 = "-1", 
                 passo1text = "Jacui", 
                 passo2text = "%MLT", 
                 passo3text = "2008",
                 passo4text = "-1"), 
     encode = "form") 

content(req, as="text") %>% 
  read_html() %>% 
  html_nodes("table.tabelaHistorico") %>% 
  html_table()

## [[1]]
##     X1     X2
## 1        2008
## 2  Jan  69,43
## 3  Fev  37,45
## 4  Mar  43,79
## 5  Abr  40,78
## 6  Mai  47,73
## 7  Jun  76,79
## 8  Jul  52,65
## 9  Ago  77,64
## 10 Set  54,15
## 11 Out 170,89
## 12 Nov 186,74
## 13 Dez  78,41

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM