I have written a function that will compare the similarity of IP addresses, and will let the user select the level of detail in the octet. for example, in the address 255.255.255.0
and 255.255.255.1
, a user could specify that they only want to compare the first, first and second, first second third etc. octets.
the function is below:
did.change.ip=function(vec, detail){
counter=2
result.vec=FALSE
r.list=strsplit(vec, '.', fixed=TRUE)
for(i in vec){
if(counter>length(vec)){
break
}
first=as.numeric(r.list[[counter-1]][1:detail])
second=as.numeric(r.list[[counter]][1:detail])
if(sum(first==second)==detail){
result.vec=append(result.vec,FALSE)
}
else{
result.vec=append(result.vec,TRUE)
}
counter=counter+1
}
return(result.vec)
}
and it's really slow once the data starts getting larger. for a dataset of 500,000 rows, the system.time()
results are:
user system elapsed
208.36 0.59 209.84
are there any R power users who have insight on how to write this more efficiently? I know lapply()
is the preferred method for looping over vectors/dataframes, but I'm stumped as to how to access the previous element in a vector for this purpose. I've tried to sketch something out quickly, but It returns a syntax error:
test=function(vec, detail){
rlist=strsplit(vec, '.', fixed=TRUE)
r.value=vapply(rlist, function(x,detail) ifelse(x[1:detail]==x[1:detail] TRUE, FALSE))
}
I've created some sample data for testing purposes below:
stack.data=structure(list(V1 = c("247.116.209.66", "195.121.47.105", "182.136.49.12",
"237.123.100.50", "120.30.174.18", "29.85.72.70", "18.186.76.177",
"33.248.142.26", "109.97.92.50", "217.138.155.145", "20.203.156.2",
"71.1.51.190", "31.225.208.60", "55.25.129.73", "211.204.249.244",
"198.137.15.53", "234.106.102.196", "244.3.87.9", "205.242.10.22",
"243.61.212.19", "32.165.79.86", "190.207.159.147", "157.153.136.100",
"36.151.152.15", "2.254.210.246", "3.42.1.208", "30.11.229.18",
"72.187.36.103", "98.114.189.34", "67.93.180.224")), .Names = "V1", class = "data.frame", row.names = c(NA,
-30L))
Here's another solution just using base R.
did.change.ip <- function(vec, detail=4){
ipv <- scan(text=paste(vec, collapse="\n"),
what=c(replicate(detail, integer()), replicate(4-detail,NULL)),
sep=".", quiet=TRUE)
c(FALSE, rowSums(vapply(ipv[!sapply(ipv, is.null)],
diff, integer(length(vec)-1))!=0)>0)
}
Here we use scan()
to break up the ip address into numbers. Then we we look down each octet for differences using diff
. It seems this is faster than the original proposal, but slightly slower than @josilber's stringr solution (using microbenchmark with 3,000 ip addresses)
Unit: milliseconds
expr min lq median uq max neval
orig 35.251886 35.716921 36.019354 36.700550 90.159992 100
scan 2.062189 2.116391 2.170110 2.236658 3.563771 100
strngr 2.027232 2.075018 2.136114 2.200096 3.535227 100
The simplest way I can think of to do this is to build a transformed vector that only includes the parts of the IP you want. Then it's a one-liner to check if each element is equal to the one before it:
library(stringr)
did.change.josilber <- function(vec, detail) {
s <- str_extract(vec, paste0("^(\\d+\\.){", detail, "}"))
return(s != c(s[1], s[1:(length(s)-1)]))
}
This seems reasonably efficient for 500,000 rows:
set.seed(144)
big.vec <- sample(stack.data[,1], 500000, replace=T)
system.time(did.change.josilber(big.vec, 3))
# user system elapsed
# 0.527 0.030 0.554
The biggest issue with your code is that you call append
each iteration, which requires reallocation of your vector 500,000 times. You can read more about this in the second circle of the R inferno .
Not sure if all you want is counts, but this is potentially a solution:
library(dplyr)
library(tidyr)
# split ip addresses into "octets"
octets <- stack.data %>%
separate(V1,c("first","second","third","fourth"))
# how many shared both their first and second octets?
octets %>%
group_by(first,second) %>%
summarize(n = n())
first second n
1 109 97 1
2 120 30 1
3 157 153 1
4 18 186 1
5 182 136 1
6 190 207 1
7 195 121 1
8 198 137 1
9 2 254 1
10 20 203 1
11 205 242 1
12 211 204 1
13 217 138 1
14 234 106 1
15 237 123 1
16 243 61 1
17 244 3 1
18 247 116 1
19 29 85 1
20 3 42 1
21 30 11 1
22 31 225 1
23 32 165 1
24 33 248 1
25 36 151 1
26 55 25 1
27 67 93 1
28 71 1 1
29 72 187 1
30 98 114 1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.