简体   繁体   中英

Geocode IP addresses in R

I have made this short code to automate geocoding of IP addresses by using the freegeoip.net (15,000 queries per hour by default; excellent service!):

> library(RCurl)
Loading required package: bitops
> ip.lst = 
c("193.198.38.10","91.93.52.105","134.76.194.180","46.183.103.8")
> q = do.call(rbind, lapply(ip.lst, function(x){ 
  try( data.frame(t(strsplit(getURI(paste0("freegeoip.net/csv/", x)), ",")[[1]]), stringsAsFactors = FALSE) ) 
}))
> names(q) = c("ip","country_code","country_name","region_code","region_name","city","zip_code","time_zone","latitude","longitude","metro_code")
> str(q)
'data.frame':   4 obs. of  11 variables:
$ ip          : chr  "193.198.38.10" "91.93.52.105" "134.76.194.180" "46.183.103.8"
$ country_code: chr  "HR" "TR" "DE" "DE"
$ country_name: chr  "Croatia" "Turkey" "Germany" "Germany"
$ region_code : chr  "" "06" "NI" ""
$ region_name : chr  "" "Ankara" "Lower Saxony" ""
$ city        : chr  "" "Ankara" "Gottingen" ""
$ zip_code    : chr  "" "06450" "37079" ""
$ time_zone   : chr  "Europe/Zagreb" "Europe/Istanbul" "Europe/Berlin" ""
$ latitude    : chr  "45.1667" "39.9230" "51.5333" "51.2993"
$ longitude   : chr  "15.5000" "32.8378" "9.9333" "9.4910"
$ metro_code  : chr  "0\r\n" "0\r\n" "0\r\n" "0\r\n"

In three lines of code you get coordinates for all IPs including city/country codes. I wonder if this could be parallelized so it runs even faster? To geocode >10,000 IPs can take hours otherwise.

library(rgeolocate)

ip_lst = c("193.198.38.10", "91.93.52.105", "134.76.194.180", "46.183.103.8")

maxmind(ip_lst, "~/Data/GeoLite2-City.mmdb", 
        fields=c("country_code", "country_name", "region_name", "city_name", 
                 "timezone", "latitude", "longitude"))

##   country_code country_name            region_name  city_name        timezone latitude longitude
## 1           HR      Croatia                   <NA>       <NA>   Europe/Zagreb  45.1667   15.5000
## 2           TR       Turkey               Istanbul   Istanbul Europe/Istanbul  41.0186   28.9647
## 3           DE      Germany           Lower Saxony Bilshausen   Europe/Berlin  51.6167   10.1667
## 4           DE      Germany North Rhine-Westphalia     Aachen   Europe/Berlin  50.7787    6.1085

There are instructions in the package for obtaining the necessary data files. Some of the fields you're pulling are woefully inaccurate (more so than any geoip vendor would like to admit). If you do need ones that aren't available, file an issue and we'll add them.

I've found multidplyr is a great package for making parallel server calls. This is the best guide I've found, and I highly recommend reading the whole thing to better understand how the package works: http://www.business-science.io/code-tools/2016/12/18/multidplyr.html

library("devtools")
devtools::install_github("hadley/multidplyr")
library(parallel)
library(multidplyr)
library(RCurl)
library(tidyverse)

# Convert your example into a function
get_ip <- function(ip) {
  do.call(rbind, lapply(ip, function(x) {
    try(data.frame(t(strsplit(getURI(
      paste0("freegeoip.net/csv/", x)
    ), ",")[[1]]), stringsAsFactors = FALSE))
  })) %>% nest(X1:X11)
}

# Made ip.lst into a Tibble to make it work better with dplyr
ip.lst =
  tibble(
    ip = c(
      "193.198.38.10",
      "91.93.52.105",
      "134.76.194.180",
      "46.183.103.8",
      "193.198.38.10",
      "91.93.52.105",
      "134.76.194.180",
      "46.183.103.8"
    )
  )

# Create a cluster based on how many cores your machine has
cl <- detectCores()
cluster <- create_cluster(cores = cl)

# Create a partitioned tibble
by_group  <- partition(ip.lst, cluster = cluster)

# Send libraries and the function get_ip() to each cluster
by_group %>%
  cluster_library("tidyverse") %>%
  cluster_library("RCurl") %>%
  cluster_assign_value("get_ip", get_ip)

# Send parallel requests to the website and parse the results
q <- by_group %>%
  do(get_ip(.$ip)) %>% 
  collect() %>% 
  unnest() %>% 
  tbl_df() %>% 
  select(-PARTITION_ID)

# Set names of the results
names(q) = c(
  "ip",
  "country_code",
  "country_name",
  "region_code",
  "region_name",
  "city",
  "zip_code",
  "time_zone",
  "latitude",
  "longitude",
  "metro_code"
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM