简体   繁体   English

如何在现有data.frame中添加其他列,这些列在data.frame中已有的一个特定列上对齐?

[英]How can I add additional columns to an existing data.frame, that are aligned on one specific column already in the data.frame?

I'm a new R user and I'm having trouble trying to replicate a left basic join and update that I would normally do in SQL. 我是一个新的R用户,我无法尝试复制左基本连接并更新我通常在SQL中执行的操作。 I've checked several previously asked questions on Stackoverflow but still cannot quite get this code right. 我已经在Stackoverflow上检查了几个先前提出的问题,但仍然无法完全正确地获得此代码。
I've been trying to build out a data.frame, starting with a single data.frame representing only all possible zip codes. 我一直在尝试构建一个data.frame,从单个data.frame开始,仅代表所有可能的邮政编码。 I have several additional data.frames each each of which that count construction years over a certain range (say 1990-1999), grouped by zip code. 我有几个额外的data.frames,每个都计算一定范围内的构造年份(比如1990-1999),按邮政编码分组。 Note that each subsequent data.frame is only a subset of zip codes from the first data.frame. 请注意,每个后续data.frame只是第一个data.frame中的邮政编码的子集。 Essentially, what I'm trying to do is build out a table, starting with a data.frame representing of all possible zip codes, and link each individual range data.frame to the table so that my final table will show all ranges for each zip code. 本质上,我要做的是构建一个表,从表示所有可能的邮政编码的data.frame开始,并将每个单独的范围data.frame链接到表,以便我的最终表将显示每个的所有范围邮政编码。 Each range data.frame will need to be aligned with the "ZIPS_ALL" variable. 每个范围data.frame都需要与“ZIPS_ALL”变量对齐。 The 1990-1999, 2000-2009 and Zips_ALL data.frames are below: 1990-1999,2000-2009和Zips_ALL数据框架如下:

    1990-1999           2000-2009         zip_codes_all
    ZIP     Count       ZIP     Count     ZIPS_ALL
    19145     1         19145     1       19145
    19146     2         19147     3       19146
    19147     2         19148     1       19147 
                                          19148

I've tried using several different Left_Joins or merge from dplyr/base_r but when trying to attach each range, it overwrites the previous range so that my final table is all zip codes and the final range only. 我尝试使用几个不同的Left_Joins或者从dplyr / base_r合并但是当尝试附加每个范围时,它会覆盖前一个范围,这样我的最终表格就是所有邮政编码和最终范围。 I need to keep all ranges of my table so that the final table shows all zip codes from "All Zip Codes", aligned with the ZIPS_ALL variable. 我需要保留我的表的所有范围,以便最终表显示“所有邮政编码”中的所有邮政编码,与ZIPS_ALL变量对齐。

    1990_1999_df <- left_join(x = zip_codes_all, y = 1990-1999, by = 
    c("ZIP_ALL" = "ZIP"))
    2000_2009_df <- left_join(x = zip_codes_all, y = 2000-2009, by = 
    c("ZIP_ALL" = "ZIP"))

Expected results would have all range data.frames lined up with all possible zip codes data.frame where missing entries, would just be NA values; 预期结果将使所有范围数据框架与所有可能的邮政编码data.frame排列,其中缺少条目,只是NA值; See below: 见下文:

    1990-1999   2000-2009   zip_codes_all
    Count       Count       ZIPS_ALL
    1           1           19145
    2           NA          19146
    2           1           19147
    NA          1           19148

The dput code for my zip_codes_all variable is: 我的zip_codes_all变量的输入代码是:

dput(droplevels(zip_codes_all[1:10,]))
structure(list(ZIP_ALL = c(23115L, 22960L, 22578L, 23936L, 23308L, 
23875L, 23518L, 23139L, 23917L, 22967L)), row.names = c(NA, -10L
), .internal.selfref = <pointer: 0x0000000000201ef0>, class = 
c("data.table", 
"data.frame"))

My updated code with actual variable names. 我用实际变量名更新了代码。 This code worked but I am wondering if there is a more efficient way of doing this where I don't have to add each range manually, since I have numerous ranges I need to build out. 这段代码有效,但我想知道是否有一种更有效的方法,我不需要手动添加每个范围,因为我需要构建多个范围。

#create your range counts by group
nn_data_1939_range <- nn_data[yearbuilt <= 1939 ,.N, by = ZIP][order(ZIP)]
nn_data_1949_range <- nn_data[yearbuilt >= 1940 & yearbuilt <= 1949 ,.N, by = ZIP][order(ZIP)]
nn_data_1959_range <- nn_data[yearbuilt >= 1950 & yearbuilt <= 1959 ,.N, by = ZIP][order(ZIP)]
nn_data_1969_range <- nn_data[yearbuilt >= 1960 & yearbuilt <= 1969 ,.N, by = ZIP][order(ZIP)]
nn_data_1979_range <- nn_data[yearbuilt >= 1970 & yearbuilt <= 1979 ,.N, by = ZIP][order(ZIP)]
nn_data_1989_range <- nn_data[yearbuilt >= 1980 & yearbuilt <= 1989 ,.N, by = ZIP][order(ZIP)]
nn_data_1999_range <- nn_data[yearbuilt >= 1990 & yearbuilt <= 1999 ,.N, by = ZIP][order(ZIP)]
nn_data_2004_range <- nn_data[yearbuilt >= 2000 & yearbuilt <= 2004 ,.N, by = ZIP][order(ZIP)]
nn_data_2005_range <- nn_data[yearbuilt >= 2005,.N, by = ZIP][order(ZIP)]


#Build your table by each range; adding each range to the previously created data.frame; join zip_all to zip
tbl_LessThan_1939 <- left_join(x = zip_codes_all, y = nn_data_1939_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_1949 <- left_join(x = tbl_LessThan_1939, nn_data_1949_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_1959 <- left_join(x = tbl_0_1949, nn_data_1959_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_1969 <- left_join(x = tbl_0_1959, nn_data_1969_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_1979 <- left_join(x = tbl_0_1969, nn_data_1979_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_1989 <- left_join(x = tbl_0_1979, nn_data_1989_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_1999 <- left_join(x = tbl_0_1989, nn_data_1999_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_2004 <- left_join(x = tbl_0_1999, nn_data_2004_range, by = c("ZIP_ALL" = "ZIP"))
tbl_0_present <- left_join(x = tbl_0_2004, nn_data_2005_range, by = c("ZIP_ALL" = "ZIP"))

All right, my best guess is your data looks something like this (though probably much bigger): 好吧,我最好的猜测是你的数据看起来像这样(虽然可能更大):

library(data.table)
set.seed(47)
nn_data_sample = data.table(
  yearbuilt = rep(c(1938, 1942, 1951, 1963), each = 4),
  ZIP = sample(c(90210, 19145, 19146, 19147, 19148, 19149), size = 16, replace = TRUE)
)
nn_data_sample
 #    yearbuilt   ZIP
 # 1:      1938 19149
 # 2:      1938 19146
 # 3:      1938 19148
 # 4:      1938 19148
 # 5:      1942 19147
 # 6:      1942 19148
 # 7:      1942 19146
 # 8:      1942 19146
 # 9:      1951 19147

This is nicely formatted data, in long format , which is easy to work with. 这是格式良好的数据, 长格式 ,易于使用。 You seem to want to (a) count rows by zipcode and by the decade they were built (more-or-less, with a little more granularity recently), and then (b) convert the long data (with one zipcode column and one time column) into a wide format , where the times are spread across many columns. 您似乎希望(a)按邮政编码计算行数,并按照它们构建的十年计算(或多或少,最近稍微更细化),然后(b)转换长数据(使用一个zipcode列和一个时间列)为宽格式 ,其中时间分布在许多列上。

For (a), we will use the cut function to divide the years into the decade-like intervals you want, and then aggregate the rows by zip code and decade. 对于(a),我们将使用cut函数将年份划分为您想要的十年间隔,然后按邮政编码和十年来汇总行数。

decade_data = nn_data_sample[, decade_built := cut(
  yearbuilt,
  breaks = c(0, seq(1939, 1999, by = 10), 2004, Inf))
][, .(n = .N), by = .(decade_built, ZIP)]

decade_data
 #    decade_built   ZIP n
 # 1:     (0,1939] 19149 1
 # 2:     (0,1939] 19146 1
 # 3:     (0,1939] 19148 2
 # 4:  (1939,1949] 19147 1
 # 5:  (1939,1949] 19148 1
 # 6:  (1939,1949] 19146 2
 # 7:  (1949,1959] 19147 1
 # 8:  (1949,1959] 19149 1
 # ...

For a lot of use cases, this is a great format to work with---data.table makes it easy to do things "by group", so if you have more operations you want to do to each decade, this should be your starting point. 对于很多用例,这是一个很好的工作格式--- data.table使得“按组”做事变得容易,所以如果你想要在每个十年做更多的操作,这应该是你的初始点。 (Since we used := the decade_built column became part of the original data, you can look at it to verify that it worked right.) (因为我们使用:= decade_built列成为原始数据的一部分,您可以查看它以验证它是否正常工作。)

But, if you want to change to wide format, dcast does that for us: 但是,如果您想要更改为宽格式, dcast会为我们dcast这一点:

dcast(decade_data, ZIP ~ decade_built, value.var = "n")
#      ZIP (0,1939] (1939,1949] (1949,1959] (1959,1969]
# 1: 19146        1           2          NA          NA
# 2: 19147       NA           1           1           2
# 3: 19148        2           1           1          NA
# 4: 19149        1          NA           1           1
# 5: 90210       NA          NA           1           1

If you want to edit the column names, you can either specify what you want from the top, using the labels argument of the cut function, or simply rename the columns at the end. 如果要编辑列名,可以使用cut函数的labels参数从顶部指定所需内容,或者只是在末尾重命名列。 Or do it in the middle, modifying the values of the decade_built column after it's created---do it wherever feels easiest. 或者在中间进行,在创建后修改decade_built列的值 - 在最简单的地方进行操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM