简体   繁体   中英

How to get Google Trends top 10 search terms in R?

In RI would like to get the top 10 search terms from Google Trends for a given category. For example the top 10 search terms for category autmotive are included in this url :

url <- "https://www.google.com/trends/explore#cat=0-47&geo=US&cmpt=q&tz=Etc%2FGMT-1"

To retrieve the search terms I tried the following:

library("rvest")
top_searches <- url %>%
  read_html() %>%
  html_nodes(xpath='//*[@class="trends-bar-chart-name"]') %>%
  html_table()

This code, however, yields an empty list (note that I use Selectorgadget to figure out 'xpath').

This is what you need:

library("rvest")

url <- 'http://www.google.com/trends/fetchComponent?hl=pl&cat=0-47&geo=US&cmpt=q&tz=Etc/GMT-1&tz=Etc/GMT-1&content=1&cid=TOP_ENTITIES_0_0&export=5&w=300&h=420'

top_searches <- url %>%
  read_html() %>% 
  html_nodes(xpath='//*[@class="trends-bar-chart-name"]') %>% 
  html_text(trim=TRUE)
# [1] "Car - Transportation mode"             "Sales - Industry"                     
# [3] "Chevrolet - Automobile Company"        "Ford - Automobile Make"               
# [5] "Tire - Industry"                       "Craigslist Inc. - Advertising company"
# [7] "Truck - Truck"                         "Engine - Literature Subject"          
# [9] "Kelley Blue Book - Company"            "Toyota - Automobile Make" 

Read on if you are interested why your approach didn't work and how I managed to solve that issue.


The problem

The problem is that what you are looking for is not in xml_document object . Data you want is loaded dynamically and rvest is not able to cope with that - it can only fetch website source code and retrieve anything that is there, without any client-side processing. As author of rvest stated , in cases like this you must "reverse engineer the communications protocol and request the raw data directly from the server" or "use a package like RSelenium to automate a web browser".

Fortunately, the first solution proved to be relatively easy.

Reverse-engineering Google Trends

On Google website that you have linked to, right below chart that you were interested in, there is that small icon: </> . Clicking it gives you HTML snippet that can be used to embed that chart on your own website .

This snippet basically executes JavaScript code that creates <iframe> element displaying content of http://www.google.com/trends/...&export=5&w=300&h=420 . As it turns out, this website contains data that you request.

However, you should realize that Google decided to publish only first HTML snippet and you should be fully aware of consequences of that.

Why this is bad idea

First, there are no promises further down the road. This HTML under </> icon will keep working until Google decide to shut down Trends embedding, because they must support sites that decided to use this snippet and forget about entire thing. But content of script that is called, URL of embedded HTML page or HTML structure might change whenever Google feels like it. Code above might stop working tomorrow.

Second, Google decided that they don't want people to call this URL directly. You can do it, although common courtesy says you shouldn't . If you decide to do it anyway, you should not abuse it. It's anyone's guess what counts as "abuse".

Minor R code improvements

Back to the R code, I called html_text() function instead of html_table() . That is because html_nodes() returns list of <span> elements, not <table> element.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM