简体   繁体   English

在循环中匹配不同长度的data.tables

[英]Match different length data.tables in loop

Building on a previous question ( R function to failback in a left_join? ), I have 24 different data tables that each use an industry classification system called NAICS and I want to find the best industry match in each table for a given list of industries. 基于上一个问题( R函数在left_join中进行故障恢复? ),我有24个不同的数据表,每个表都使用一个名为NAICS的行业分类系统,我希望在给定行业列表的每个表中找到最佳的行业匹配。

The industry codes get less detailed as they get shorter, so if there isn't an exact match, I want a slightly shorter version of the target. 随着行业代码变短,行业代码变得越来越不详细,因此如果没有完全匹配,我想要稍微缩短目标版本。 For example, using classification code 311111 as the target: 例如,使用分类代码311111作为目标:

  1. One table may have an exact match: 311111 一个表可能具有完全匹配:311111
  2. One table may have one level less detailed: 31111 一个表可能有一个级别不太详细:31111
  3. One table may only have a much less detailed match: 31 一张桌子可能只有一个不那么详细的比赛:31

Current approach (see below for code): Loop through all of the tables and then loop through each code length (311111, 31111, 3111, 311, 31, 3) and try to find a match in that table. 当前的方法(参见下面的代码):遍历所有表,然后遍历每个代码长度(311111,31111,3111,311,31,3)并尝试在该表中找到匹配项。

My problem: 我的问题:

How do I adjust the code so that multiple instances of a match don't create an error (as in Supplied 261022 items to be assigned to 360 items of column 'match' )? 如何调整代码,以便匹配的多个实例不会产生错误(如Supplied 261022 items to be assigned to 360 items of column 'match' Some data is time series data, so the same industry code will be listed with 100 or more observations. 一些数据是时间序列数据,因此将列出具有100个或更多观察值的相同行业代码。 Some data is cross sectional, so the industry codes only appear once. 有些数据是横截面的,因此行业代码只出现一次。

Full code for context, but question refers to Step 4 : 上下文的完整代码,但问题是指第4步

library(data.table)

# Step 1: Load Table Data -------------------------------------------------
v_tablenames <- c("t_naics17index", "t_naics17def", "t_naics17cross", "t_naics17tree", 
                  "t_naics17isic4cross", "t_ios_2012", "t_iou_2012", "t_regdata6dig_2017", 
                  "t_brdis_2015", "t_mrkcon_2012", "t_matkind_2012", "t_ppiprice", 
                  "t_eximprice", "t_oes", "t_ces", "t_cps", "t_fed", "t_asm", "t_vps", 
                  "t_cbp", "t_exports", "t_imports", "t_expartner", "t_impartner")

for(tablename in v_tablenames){
  assign(tablename, readRDS(paste0("DataStore/", tablename, ".rds")))
}

# Step 2: Turn all of the tibbles into data.tables ------------------------
# Data wrangling done in the tidyverse; tibbles converted to data.tables
l_tables <- list(t_naics17index, t_naics17def, t_naics17cross, t_naics17tree, 
                 t_naics17isic4cross, t_ios_2012, t_iou_2012, t_regdata6dig_2017,
                 t_brdis_2015, t_mrkcon_2012, t_matkind_2012, t_ppiprice, 
                 t_eximprice, t_oes, t_ces, t_cps, t_fed, t_asm, t_vps, 
                 t_cbp, t_exports, t_imports, t_expartner, t_impartner)

lapply(l_tables, setDT)

# Step 3: Build Master Lookup Table ---------------------------------------
# Subset of classification codes I care about falls between 3----- and 4-----; pulled from t_naics17index, which has a complete list of codes
t_match <- unique(t_naics17index[NAICS17 >= "300000" & NAICS17 < "400000", c(1)])

# Step 4: Connect Data Tables ---------------------------------------------
code_len_count <- rev(seq_len(max(nchar(t_match$NAICS17))))

for (tablename in v_tablenames){
  t_match[, match := NA_character_]
  for (i in code_len_count){
    t_match[is.na(match), target := substr(NAICS17, 1, i)]
    t_match[is.na(match), match := get(tablename)[.SD, on=.(NAICS17 = target), mget("NAICS17")][]]
  }
  setnames(t_match, "match", paste0("m_", tablename))
}

Data examples: 数据示例:

# Table of target industry codes
t_match <- structure(list(NAICS17 = c("311111", "311119", "311211", "311212", 
"311213", "311221", "311224", "311225", "311230", "311313")), row.names = c(NA, 
-10L), class = "data.frame")

# NAICS17 column is unique:
t_naics17tree <- structure(list(NAICS17 = c("31-33", "311", "3111", "31111", "311111", 
"311119", "3112", "31121", "311211", "311212"), NAICS17Title = c("Manufacturing", 
"Food Manufacturing", "Animal Food Manufacturing", "Animal Food Manufacturing", 
"Dog and Cat Food Manufacturing", "Other Animal Food Manufacturing", 
"Grain and Oilseed Milling", "Flour Milling and Malt Manufacturing", 
"Flour Milling", "Rice Milling")), row.names = c(NA, 10L), class = "data.frame")

# NAICS17 column is NOT unique:
t_ppiprice <- structure(list(NAICS17 = c("311---", "311---", "311---", "311---", 
"311---", "311---", "311---", "311---", "311---", "311---"), 
    seriesID = c("PCU311---311---", "PCU311---311---", "PCU311---311---", 
    "PCU311---311---", "PCU311---311---", "PCU311---311---", 
    "PCU311---311---", "PCU311---311---", "PCU311---311---", 
    "PCU311---311---"), date = structure(c(17956, 17928, 17897, 
    17866, 17836, 17805, 17775, 17744, 17713, 17683), class = "Date"), 
    value = c(199.2, 198.9, 198.3, 197.9, 197.2, 197.4, 197.1, 
    197.7, 198.8, 200.2)), class = "data.frame", row.names = c(NA, 
-10L))

For posterity, I figured it out... 对于后代,我想通了......

for (tablename in v_tablenames){
  t_match[, match := NA_character_]
  for (i in code_len_count){
    t_match[is.na(match), target := substr(NAICS17, 1, i)]
    t_match[is.na(match), match := get(paste0("t_", tablename))[.SD, on=.(NAICS17 = target), mult = "first", mget("x.NAICS17")][]]
  }
  setnames(t_match, "match", paste0("m_", tablename))
}

Adding get() around tablename allows the loop to reference the variables and the variable names. tablename周围添加get()允许循环引用变量和变量名称。

Adding mult = "first" allows the join to only take the first match 添加mult = "first"允许连接仅进行第一次匹配

Thanks for the help @Cole! 感谢@Cole的帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM