简体   繁体   English

替换R中的嵌套循环

[英]Replacing nested loop in R

I am very new to R and searched about this on forums but couldn't get a close enough solution for this. 我是R的新手并在论坛上搜索过这个但是无法得到足够的解决方案。 I am trying to do a mapping between the ip address & corresponding geo locations. 我正在尝试在IP地址和相应的地理位置之间进行映射。 I have 2 data sets. 我有2个数据集。

Set-a (1,60,000 rows):
ip(int) | ID(int)

Set-b (16,00,000 rows):
Ip1(int) | Ip2(int) | Code(str) | Country(str) | Area1(str) | Area2(str)

I am trying to do the following: if ip lies between Ip1 & Ip2 then add Country & Region to Set-a. 我正在尝试执行以下操作: 如果ip位于Ip1和Ip2之间,则将Country&Region添加到Set-a。

I am doing the following (obviously not a very good way to do this): 我正在做以下事情(显然不是一个非常好的方法):

ip1<-as.numeric(b$Ip1)
ip2<-as.numeric(b$Ip2)
country<-b$Country
area1<-b$Area1
area2<-b$Area2

for(i in 1:160000){
  for(j in 1:1674303){
    if(a[i]>ip1[j] & a[i]<ip2[j]) {
                                   a$country[i]<-country[j]
                                   a$area1[i]<-area1[j]
                                   a$area2[i]<-area2[j]}
   }
}

Can someone please tell me an efficient way to do this. 有人可以告诉我一个有效的方法来做到这一点。 This is taking a lot of time. 这花费了很多时间。 (for i=1 to 100 took some 10 mins to run). (对于i = 1到100需要大约10分钟才能运行)。

The sample data set-b is: 样本数据集-b是:

Ip1, Ip2, Code, Country, Area1, Area2
"0","16777215","-","-","-","-"
"16777216","16777471","AU","AUSTRALIA","QUEENSLAND","SOUTH BRISBANE"
"16777472","16778239","CN","CHINA","FUJIAN","FUZHOU"
"16778240","16778495","AU","AUSTRALIA","VICTORIA","MELBOURNE"
"16778496","16778751","AU","AUSTRALIA","NEW SOUTH WALES","SYDNEY"

It is in continuos increasing order. 它是在不断增加的顺序。

The dput(head(a)) & dput(head(b)) respectively are: (refer sample data above) dput(head(a))和dput(head(b))分别为:(参考上面的示例数据)

structure(IP_Addr = c("38825563", "38921619", "42470287", "42471923","42473368","42473428"), 
 Desc_value = c("0", "1.2", "4.97", "1", "5.9", "22.06")), .Names = c("IP_Addr", "Desc_value"), row.names = c(NA, 6L), class = "data.frame")

structure(list(Ip1 = c("0", "16777216", "16777472", "16778240", 
"16778496", "16778752"), Ip2 = c("16777215", "16777471", "16778239", 
"16778495", "16778751", "16779263"), Code = c("-", "AU", "CN", 
"AU", "AU", "AU"), Country = c("-", "AUSTRALIA", "CHINA", "AUSTRALIA", 
"AUSTRALIA", "AUSTRALIA"), Area1 = c("-", "QUEENSLAND", "FUJIAN", 
"VICTORIA", "NEW SOUTH WALES", "-"), Area2 = c("-", "SOUTH BRISBANE", 
"FUZHOU", "MELBOURNE", "SYDNEY", "-")), .Names = c("Ip1", "Ip2", 
"Code", "Country", "Area1", "Area2"), row.names = c(NA, 6L), class = "data.frame")

Here's a data.table solution: 这是一个data.table解决方案:

# Let's take Blue Magister's example set:
set.seed(10)
a <- data.frame(ip=sample(16777216:16778751,10,replace=TRUE))
b <- read.table(sep=",",header=TRUE,text='Ip1, Ip2, Code, Country, Area1, Area2
"0","16777215","-","-","-","-"
"16777216","16777471","AU","AUSTRALIA","QUEENSLAND","SOUTH BRISBANE"
"16777472","16778239","CN","CHINA","FUJIAN","FUZHOU"
"16778240","16778495","AU","AUSTRALIA","VICTORIA","MELBOURNE"
"16778496","16778751","AU","AUSTRALIA","NEW SOUTH WALES","SYDNEY"')

b$Ip1 <-as.numeric(b$Ip1)

# include library, convert to data.table
library(data.table)

a = data.table(a)
b = data.table(b, key = "Ip1")

# and now the actual computation
a = b[a, roll = Inf][, Ip2 := NULL] # yep, amazingly, it's *that* simple in data.table
setnames(a, "Ip1", "ip")            # you can also include, exclude whatever columns you want
a
#          ip Code   Country      Area1          Area2
# 1: 16777995   CN     CHINA     FUJIAN         FUZHOU
# 2: 16777687   CN     CHINA     FUJIAN         FUZHOU
# 3: 16777871   CN     CHINA     FUJIAN         FUZHOU
# 4: 16778280   AU AUSTRALIA   VICTORIA      MELBOURNE
# 5: 16777346   AU AUSTRALIA QUEENSLAND SOUTH BRISBANE
# 6: 16777562   CN     CHINA     FUJIAN         FUZHOU
# 7: 16777637   CN     CHINA     FUJIAN         FUZHOU
# 8: 16777634   CN     CHINA     FUJIAN         FUZHOU
# 9: 16778161   CN     CHINA     FUJIAN         FUZHOU
#10: 16777875   CN     CHINA     FUJIAN         FUZHOU

Had Ip1 been an exhaustive list of numbers that ip could match, then above would simply be a merge (of Ip1 in b with first column of a , ie ip ), but data.table also provides an option of what to do when there is no exact match. Ip1被数字,一个详尽的清单ip可以匹配,然后上面,简直是(的合并Ip1b与第一列a ,即ip ),但data.table还提供了当是做什么的选项没有完全匹配。 You can tell it to eg roll the previous observation forward (which is what I did above), or roll it back or roll to the nearest observation - see ?data.table for a little more information. 您可以告诉它例如向前滚动前一个观察(这是我上面所做的),或者将其向后滚动或滚动到最近的观察点 - 请参阅?data.table以获取更多信息。

Couldn't you remove the 2nd loop using, 你不能删除第二个循环使用,

j = intersect(which(ip1 < x[i]), which(ip2 > x[i]))
if  (length(j)==1){
         a$country[i]<-country[j]
         a$area1[i]<-area1[j]
         a$area2[i]<-area2[j]
}else{
         cat("Multiple matches found!\n")  
}

I would try findInterval : 我会尝试findInterval

#create example
set.seed(10)
a <- data.frame(ip=sample(16777216:16778751,10,replace=TRUE))
b <- read.table(sep=",",header=TRUE,text='Ip1, Ip2, Code, Country, Area1, Area2
"0","16777215","-","-","-","-"
"16777216","16777471","AU","AUSTRALIA","QUEENSLAND","SOUTH BRISBANE"
"16777472","16778239","CN","CHINA","FUJIAN","FUZHOU"
"16778240","16778495","AU","AUSTRALIA","VICTORIA","MELBOURNE"
"16778496","16778751","AU","AUSTRALIA","NEW SOUTH WALES","SYDNEY"')

b$Ip1 <-as.numeric(b$Ip1)
indices <- findInterval(a$ip,b$Ip1,rightmost.closed=FALSE,all.inside=FALSE)
a <- data.frame(a,b[indices,c("Country","Area1","Area2")])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM