[英]R Selecting column in a data frame by column in another data frame
I am facing a problem when trying to subset my data, maybe you could help me.我在尝试对数据进行子集化时遇到问题,也许您可以帮助我。 What I need is to subset data from first data frame by a column when this column value is equal to the value of a column in the second data frame.我需要的是当该列值等于第二个数据框中的列值时,将第一个数据框中的数据按列进行子集化。
The following are the dataframes I'm using:以下是我正在使用的数据框:
> head(places)
Zona Poble lat lon alt
1 1 Zorita 40.7353 -0.165748 691.867
2 1 Morella 40.6287 -0.113284 955.719
3 1 Forcall 40.6621 -0.209759 753.882
4 2 Benasal 40.3943 -0.126111 848.171
5 2 Cati 40.4532 0.060409 667.610
6 2 Fredes 40.7079 0.167981 1194.730
> head(data)
date time stat_id lat lon tempc
1 20121122 000000 1 40.7353 -0.1657 7.98737
2 20121122 000000 2 40.6287 -0.1133 6.49903
3 20121122 000000 3 40.6621 -0.2098 7.72955
4 20121122 000000 4 40.3943 -0.1261 7.98837
5 20121122 000000 5 40.4532 0.0604 10.35480
6 20121122 000000 6 40.7079 0.1680 6.00769
As you can see, three first places in dataframe "places" belong to Zona == 1 and share lat/lon with three first rows in dataframe "data".如您所见,数据帧“位置”中的前三个位置属于 Zona == 1,并与数据帧“数据”中的前三个行共享纬度/经度。 I would like to select rows in data that share lat/lon with Zona == i on places.dat.我想在places.dat 上选择与Zona == i 共享纬度/经度的数据行。
The R script I am trying is我正在尝试的 R 脚本是
datos=read.table("data.dat",header=T)
places=read.table("places.dat",header=T)
data=as.data.frame(datos)
place=as.data.frame(pobles)
data$time[data$time == 0] = "000000"
subset(data,data$lat == place$lat[place$Zona == 1])
So, subset would show three rows for each time in data.dat but it is only selecting two of three, as it follows因此,子集将在 data.dat 中每次显示三行,但它只选择三行中的两行,如下所示
> subset(data,data$lat == place$lat[place$Zona == 1])
date time stat_id lat lon tempc
1 20121122 000000 1 40.7353 -0.1657 7.98737
2 20121122 000000 2 40.6287 -0.1133 6.49903
385 20121122 30000 1 40.7353 -0.1657 7.00632
386 20121122 30000 2 40.6287 -0.1133 4.83684
769 20121122 60000 1 40.7353 -0.1657 6.55283
770 20121122 60000 2 40.6287 -0.1133 4.85467
1153 20121122 90000 1 40.7353 -0.1657 6.35216
1154 20121122 90000 2 40.6287 -0.1133 5.66342
1537 20121122 120000 1 40.7353 -0.1657 11.47750
1538 20121122 120000 2 40.6287 -0.1133 10.30310
1921 20121122 150000 1 40.7353 -0.1657 13.87090
1922 20121122 150000 2 40.6287 -0.1133 11.90640
2305 20121122 180000 1 40.7353 -0.1657 10.30840
2306 20121122 180000 2 40.6287 -0.1133 7.61322
2689 20121122 210000 1 40.7353 -0.1657 6.29745
2690 20121122 210000 2 40.6287 -0.1133 6.63173
3073 20121123 000000 1 40.7353 -0.1657 4.78633
3074 20121123 000000 2 40.6287 -0.1133 5.31070
3457 20121123 30000 1 40.7353 -0.1657 6.84001
3458 20121123 30000 2 40.6287 -0.1133 6.88369
3841 20121123 60000 1 40.7353 -0.1657 5.71790
For sure I'm missing something, could you help me?我肯定遗漏了什么,你能帮我吗? Any idea or hint will be appreciated.任何想法或提示将不胜感激。
Thanks谢谢
Data files are available here:数据文件可在此处获得:
EDIT Following answer from @AR I tried this code to select data but not sure if it is just the exact way.编辑以下来自@AR 的回答我尝试使用此代码来选择数据,但不确定它是否只是正确的方式。
for(i in 1:128) {
for(j in 1:2) {
a=sqrt((place$lat[i]-datos$lat[j])^2+(place$lon[i]-datos$lon[j])^2)
n=which.min(a)
while(n <= 9344) {
b=cbind(i,n,datos$tempc[n],place$Zona[i])
n=n+128
}
}
}
and get:并得到:
> b
i n
[1,] 128 9217 10.1198 30
it gives just the value for the last i value, I would like to save all.它只给出最后一个 i 值的值,我想保存所有。 Sure it is a basic but I can't figure out, please be patient as I'm not a experienced R user.当然这是一个基本的但我无法弄清楚,请耐心等待,因为我不是有经验的 R 用户。 Thanks again再次感谢
first you need to round the decimals of places lon to 4 digits.首先你需要的地方LON的小数四舍五入到4位。 Probably this is the reason why you are having problems:可能这就是您遇到问题的原因:
places=read.table("places.dat",header=T)
places=round(places$lon,digits=4)
datos[which((datos$lat==places$lat & datos$lon==places$lon) & places$Zona==1),]
The result for this condition is a total of 146 points.这种情况的结果是总共 146 分。
Edit 1 (following a comment by Sean)编辑 1 (根据肖恩的评论)
I assumed in my anwswer that in places , the lat was rounded and long not.我在我的回答中假设,在某些地方,纬度是圆形的,而不是长的。
But as was pointed out by Sean,comparing floats is not a good idea.但正如肖恩指出的那样,比较花车不是一个好主意。 It's better to calculate the distance between each places point and datos point, and select the one with the smallest distance, bellow a minimum distance (eg half of the distance between the points in datos ), as the matching one.最好计算每个地点点和datos点之间的距离,并选择距离最小的那个,比最小距离(例如datos中点之间的距离的一半)作为匹配点。
Edit 2编辑 2
Try something like this:尝试这样的事情:
b=matrix(nrow=dim(places)[1],ncol=5)
a=c()
data.p=c()
n=c()
for(i in 1:dim(places)[1]) {
for(j in 1:dim(data)[1]) {
a[j]=sqrt((places$lat[i]-data$lat[j])^2+(places$lon[i]-data$lon[j])^2)
}
data.p[i]=which.min(a)
n[i]=min(a)
}
b=cbind(places=1:(dim(places)[1]),data=data.p,distance=n,tempc=data$tempc[data.p],Zona=places$Zona)
than do some queries:比做一些查询:
b[which(b[,3]<1),]
b[which(b[,3]<0.00001),]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.