简体   繁体   中英

Web Scraping in R from Google Images

I have used "rvest" package to web scrape for different purposes. Now I need to use it to get source of an image object (png) from google images. I have tried the solution on this link: Web scraping of image . It does exactly what I want to do. So I come up with the code below but my html_nodes function gets empty object.

library("rvest")
page <- read_html("https://www.google.com.tr/search?q=manitou&espv=2&biw=1366&bih=662&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjCnJ6H2ITRAhWCQBoKHfQ5DUAQ_AUIBigB#tbm=isch&q=apple+logo+png")
node <- html_nodes(page,xpath='//*[@id="rg_s"]/div[1]/a/img')
src <-  html_attr(node,"src")

I also tried css selector and the name of the image as it is done on the link I gave above. My node object is empty in any ways. I should also point out that I want to scrape the source of very first image on the link, which has the xpath that I wrote above. Thank you in advance.

I think it is working fine, you just don't understand the makeup of that file well enough yet, ie there is probably no node corresponding to the xpath selector you wrote.

Here for example I select all the <img> nodes and print them out:

library("rvest")
page <- read_html("https://www.google.com.tr/search?q=manitou&espv=2&biw=1366&bih=662&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjCnJ6H2ITRAhWCQBoKHfQ5DUAQ_AUIBigB#tbm=isch&q=apple+logo+png")
node <- html_nodes(page,xpath = '//img')
node

yielding:

{xml_nodeset (21)}
 [1] <img style="padding-top:2px" src="/textinputassistant/tia.png" onclick="(function(){var text_input_assistant_js='/textinputassistant/11/tr_tia.js';var s = document.createElement('s ...
 [2] <img height="113" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRg92_01ZbpYpV_agaHP4M3GoRoaCsZW5Sym8eqcXG8M1iJ8Nag1SXufq8" width="150" alt="manitou ile ilgili görsel s ...
 [3] <img height="98" src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcSbJOecoEPbrJjZ-TjJMgMwlulXRMPLBWZX45vwUJNVXZk5MeY1chaZ07Y" width="143" alt="manitou ile ilgili görsel so ...
 [4] <img height="79" src="https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcStpgymO--9B7R3O3OZJFrDsuOUuP94HwwNw-av9tUyjziG3sCl6M9s7G4" width="141" alt="manitou ile ilgili görsel so ...
 [5] <img height="95" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTkibMqBWEifcyw_d-vrNob6UqYP-hDFPoQG2pkzVsP5bgmbReFWqyHjWA" width="143" alt="manitou ile ilgili görsel so ...
 [6] <img height="91" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRhqrV1f--7QrQwovNBUHIpDFHe8Zwwad3UIvnwppv74GRIrsI1XYNPkFOg" width="150" alt="manitou ile ilgili görsel s ...
 [7] <img height="112" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS1gpUEBucliP4WK2_22K4wElI2lIrDs2PZT7sRCLXK1Yxjg7DoQ2BtyLat" width="142" alt="manitou ile ilgili görsel  ...
 [8] <img height="69" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcSssiUhuZe_1YmQ9dwmYHdKoFXyQBj9IQPGX_LU8msjekOvRRHDG9FmoaD_" width="140" alt="manitou ile ilgili görsel s ...
 [9] <img height="113" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTCM9Mu6K63QpzNk20HFrHkybi--dw3JPu5JDd4LSEqz3UT5TBU5I0owLU" width="150" alt="manitou ile ilgili görsel s ...
[10] <img height="95" src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcS8sI3fBSJmjftqC9Rx2bhXh_xgP3-nS2WuD2as9U_87SLxggQvmo2awDk" width="143" alt="manitou ile ilgili görsel so ...
[11] <img height="83" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcT-gf45JbC4Q4lD3hioj_CP6imrO5RUWBeW6IuygNaN8LM1qydX56l5gFx4" width="148" alt="manitou ile ilgili görsel s ...
[12] <img height="84" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcS6tnxPJYeS48IoNAlN0D52U5TNjmq7Ta-GcPNifM4_k40Y2D8LDj5-e-Wz" width="150" alt="manitou ile ilgili görsel s ...
[13] <img height="140" src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcTwmI9PxfLBT2dCPnR04I9pXmK8V9whAI2yEv4dX5qQq8G_JxHUAOwQB1mSTg" width="140" alt="manitou ile ilgili görse ...
[14] <img height="71" src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcQNx2Pe1AZtT-0XQ44HSurWO6O2syXrXG6YPfggtZsTHaf6YXuQlcmMOu0" width="150" alt="manitou ile ilgili görsel so ...
[15] <img height="130" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRGACLfeRm6U0xwSeYncSUDQtcd4noTewVF4aGnQcgz6TWYwwr917mjEtB6" width="113" alt="manitou ile ilgili görsel  ...
[16] <img height="107" src="https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcQ1RwAscQpzVXfquuAoPaLE9hFMuZSOpo6ckOzdpkTmg3KiswOIZIDTqrU" width="143" alt="manitou ile ilgili görsel s ...
[17] <img height="98" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcTE5sLf71TxAYla6nlfLRgXwL1IC-gXzXQRq1ZcnB21c5NXmQklJyNeqEs" width="148" alt="manitou ile ilgili görsel so ...
[18] <img height="91" src="https://encrypted-tbn1.gstatic.com/images?q=tbn:ANd9GcRHQjJ-Hc0Muy6Vjw5OlQZocflSCqR3oz0GBRu3Bs7_JCoNyjr5vjNP7KZ4" width="137" alt="manitou ile ilgili görsel s ...
[19] <img height="68" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcR8R_39V3bxWJUDdNhrsAS6YOYEg6U-QpaLEV0MQ5GBnVkeZa9lSB5MaGU" width="149" alt="manitou ile ilgili görsel so ...
[20] <img height="99" src="https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTIrnwcUbo9WYT-gyvrLb5g4JFEc27odkzzU6SwzxrxvrsajRMD1OroUaY" width="116" alt="manitou ile ilgili görsel so ...
...
> 

And here is the first node:

>node[[1]]
{xml_node} <img style="padding-top:2px"
src="/textinputassistant/tia.png" onclick="(function(){var
   text_input_assistant_js='/textinputassistant/11/tr_tia.js';var s =
  document.createElement('script');s.src =
  text_input_assistant_js;(document.getElementById('xjsc')||
  document.body).appendChild(s);})();" 
   alt="" height="23" width="27">

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM