[英]Replacing nested loop in R
I am very new to R and searched about this on forums but couldn't get a close enough solution for this. 我是R的新手并在论坛上搜索过这个但是无法得到足够的解决方案。 I am trying to do a mapping between the ip address & corresponding geo locations. 我正在尝试在IP地址和相应的地理位置之间进行映射。 I have 2 data sets. 我有2个数据集。
Set-a (1,60,000 rows):
ip(int) | ID(int)
Set-b (16,00,000 rows):
Ip1(int) | Ip2(int) | Code(str) | Country(str) | Area1(str) | Area2(str)
I am trying to do the following: if ip lies between Ip1 & Ip2 then add Country & Region to Set-a. 我正在尝试执行以下操作: 如果ip位于Ip1和Ip2之间,则将Country&Region添加到Set-a。
I am doing the following (obviously not a very good way to do this): 我正在做以下事情(显然不是一个非常好的方法):
ip1<-as.numeric(b$Ip1)
ip2<-as.numeric(b$Ip2)
country<-b$Country
area1<-b$Area1
area2<-b$Area2
for(i in 1:160000){
for(j in 1:1674303){
if(a[i]>ip1[j] & a[i]<ip2[j]) {
a$country[i]<-country[j]
a$area1[i]<-area1[j]
a$area2[i]<-area2[j]}
}
}
Can someone please tell me an efficient way to do this. 有人可以告诉我一个有效的方法来做到这一点。 This is taking a lot of time. 这花费了很多时间。 (for i=1 to 100 took some 10 mins to run). (对于i = 1到100需要大约10分钟才能运行)。
The sample data set-b is: 样本数据集-b是:
Ip1, Ip2, Code, Country, Area1, Area2
"0","16777215","-","-","-","-"
"16777216","16777471","AU","AUSTRALIA","QUEENSLAND","SOUTH BRISBANE"
"16777472","16778239","CN","CHINA","FUJIAN","FUZHOU"
"16778240","16778495","AU","AUSTRALIA","VICTORIA","MELBOURNE"
"16778496","16778751","AU","AUSTRALIA","NEW SOUTH WALES","SYDNEY"
It is in continuos increasing order. 它是在不断增加的顺序。
The dput(head(a)) & dput(head(b)) respectively are: (refer sample data above) dput(head(a))和dput(head(b))分别为:(参考上面的示例数据)
structure(IP_Addr = c("38825563", "38921619", "42470287", "42471923","42473368","42473428"),
Desc_value = c("0", "1.2", "4.97", "1", "5.9", "22.06")), .Names = c("IP_Addr", "Desc_value"), row.names = c(NA, 6L), class = "data.frame")
structure(list(Ip1 = c("0", "16777216", "16777472", "16778240",
"16778496", "16778752"), Ip2 = c("16777215", "16777471", "16778239",
"16778495", "16778751", "16779263"), Code = c("-", "AU", "CN",
"AU", "AU", "AU"), Country = c("-", "AUSTRALIA", "CHINA", "AUSTRALIA",
"AUSTRALIA", "AUSTRALIA"), Area1 = c("-", "QUEENSLAND", "FUJIAN",
"VICTORIA", "NEW SOUTH WALES", "-"), Area2 = c("-", "SOUTH BRISBANE",
"FUZHOU", "MELBOURNE", "SYDNEY", "-")), .Names = c("Ip1", "Ip2",
"Code", "Country", "Area1", "Area2"), row.names = c(NA, 6L), class = "data.frame")
Here's a data.table
solution: 这是一个data.table
解决方案:
# Let's take Blue Magister's example set:
set.seed(10)
a <- data.frame(ip=sample(16777216:16778751,10,replace=TRUE))
b <- read.table(sep=",",header=TRUE,text='Ip1, Ip2, Code, Country, Area1, Area2
"0","16777215","-","-","-","-"
"16777216","16777471","AU","AUSTRALIA","QUEENSLAND","SOUTH BRISBANE"
"16777472","16778239","CN","CHINA","FUJIAN","FUZHOU"
"16778240","16778495","AU","AUSTRALIA","VICTORIA","MELBOURNE"
"16778496","16778751","AU","AUSTRALIA","NEW SOUTH WALES","SYDNEY"')
b$Ip1 <-as.numeric(b$Ip1)
# include library, convert to data.table
library(data.table)
a = data.table(a)
b = data.table(b, key = "Ip1")
# and now the actual computation
a = b[a, roll = Inf][, Ip2 := NULL] # yep, amazingly, it's *that* simple in data.table
setnames(a, "Ip1", "ip") # you can also include, exclude whatever columns you want
a
# ip Code Country Area1 Area2
# 1: 16777995 CN CHINA FUJIAN FUZHOU
# 2: 16777687 CN CHINA FUJIAN FUZHOU
# 3: 16777871 CN CHINA FUJIAN FUZHOU
# 4: 16778280 AU AUSTRALIA VICTORIA MELBOURNE
# 5: 16777346 AU AUSTRALIA QUEENSLAND SOUTH BRISBANE
# 6: 16777562 CN CHINA FUJIAN FUZHOU
# 7: 16777637 CN CHINA FUJIAN FUZHOU
# 8: 16777634 CN CHINA FUJIAN FUZHOU
# 9: 16778161 CN CHINA FUJIAN FUZHOU
#10: 16777875 CN CHINA FUJIAN FUZHOU
Had Ip1
been an exhaustive list of numbers that ip
could match, then above would simply be a merge (of Ip1
in b
with first column of a
, ie ip
), but data.table
also provides an option of what to do when there is no exact match. 曾Ip1
被数字,一个详尽的清单ip
可以匹配,然后上面,简直是(的合并Ip1
在b
与第一列a
,即ip
),但data.table
还提供了当是做什么的选项没有完全匹配。 You can tell it to eg roll the previous observation forward (which is what I did above), or roll it back or roll to the nearest observation - see ?data.table
for a little more information. 您可以告诉它例如向前滚动前一个观察(这是我上面所做的),或者将其向后滚动或滚动到最近的观察点 - 请参阅?data.table
以获取更多信息。
Couldn't you remove the 2nd loop using, 你不能删除第二个循环使用,
j = intersect(which(ip1 < x[i]), which(ip2 > x[i]))
if (length(j)==1){
a$country[i]<-country[j]
a$area1[i]<-area1[j]
a$area2[i]<-area2[j]
}else{
cat("Multiple matches found!\n")
}
I would try findInterval
: 我会尝试findInterval
:
#create example
set.seed(10)
a <- data.frame(ip=sample(16777216:16778751,10,replace=TRUE))
b <- read.table(sep=",",header=TRUE,text='Ip1, Ip2, Code, Country, Area1, Area2
"0","16777215","-","-","-","-"
"16777216","16777471","AU","AUSTRALIA","QUEENSLAND","SOUTH BRISBANE"
"16777472","16778239","CN","CHINA","FUJIAN","FUZHOU"
"16778240","16778495","AU","AUSTRALIA","VICTORIA","MELBOURNE"
"16778496","16778751","AU","AUSTRALIA","NEW SOUTH WALES","SYDNEY"')
b$Ip1 <-as.numeric(b$Ip1)
indices <- findInterval(a$ip,b$Ip1,rightmost.closed=FALSE,all.inside=FALSE)
a <- data.frame(a,b[indices,c("Country","Area1","Area2")])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.