简体   繁体   English

当列名不同时如何通过选择特定列来组合多个数据框

[英]How to combine multiple dataframes by selecting specific columns when the column names are different

I have seven data.frames within a list my_data .我在 my_data 列表中有七个my_data Three of these data.frames have 16 columns, the other four have 22 columns.其中三个 data.frames 有 16 列,另外四个有 22 列。 There are five columns in each data.frame that I need to bind into one data.frame ( all_data ).我需要将每个 data.frame 中的五列绑定到一个 data.frame ( all_data ) 中。 The problem is that I can't simply select the columns I want to retain by name, because the names are different (but similar) between each data.frame, and in different orders.问题是我不能简单地 select 我想按名称保留的列,因为每个 data.frame 之间的名称不同(但相似),并且顺序不同。 For example, I have one data.frame that has a column titled "X2012.NAICS.code" and one that has a column titled "X2007.NAICS.codes.and.NAICS.based.rollup.code".例如,我有一个 data.frame 有一个标题为“X2012.NAICS.code”的列,另一个有一个标题为“X2007.NAICS.codes.and.NAICS.based.rollup.code”的列。 These columns contain the same info (NAICS codes) and need to be bond together.这些列包含相同的信息(NAICS 代码)并且需要绑定在一起。 The approach I am trying to use is this:我尝试使用的方法是这样的:

header_cols <- c( "Geographic.area.name", "Year", "**3rd column**", "**4th column**", "**5th column**" )

all_data <- map_dfr( my_data[grepl( "^ASM", names( my_data ))], ~ 
                               .x %>%
                               select( header_cols ))

Where the 3rd, 4th, and 5th columns are the three others I need ( Year and Geographic.area.name are the same between all 7 data.frames).第 3、第 4 和第 5 列是我需要的其他三列(所有 7 个 data.frames 之间的YearGeographic.area.name相同)。

All data.frame names begin with "ASM", which is what the ^ASM is for.所有 data.frame 名称都以“ASM”开头,这就是^ASM的用途。

UPDATE: My current strategy is this更新:我目前的策略是这样的

# Make object for raw column name strings (all columns of interest contain these strings in all dataframes)
name_pattern <- c( "Geographic.area.name", "Geographic Area Name")
VoS_pattern <- c( "Total.value.of.shipment", "value of shipments")
NAICS_pattern <- c( "NAICS.code", "NAICS code")
industry_pattern <- c("Meaning.of.", "Meaning of NAICS code")
relative_pattern <- c("Relative.standard.error", "Relative standard error")
header_cols <- c( "Year" )

# Part 3: binding the data into one dataframe based on the columns of interest, uniting columns that contain the same information category
# Bind the columns of interest into one dataframe
combined_data <- map_dfr( my_data[grepl( "^ASM", names( my_data ))], ~
                            .x %>%
                            select( header_cols, contains( paste0( name_pattern ) ),
                                    contains( paste0( VoS_pattern ) ),
                                    contains( paste0( NAICS_pattern ) ),
                                    contains( paste0( industry_pattern ) ),
                                    -contains ( paste0( relative_pattern) ) ))

which works perfectly.效果很好。 Unfortunately, I can't use the map_dfr function (or any function specific to purrr, so am looking for a way to do this with rbind.不幸的是,我不能使用map_dfr function(或任何特定于 purrr 的 function,所以我正在寻找一种使用 rbind 的方法。

One option is to standardize the column names with rename_at after select ing the columns.一种选择是在rename_at之后使用select标准化列名。

library(dplyr)
library(stringr)
library(purrr)
map_dfr(my_data[grep('^ASM', names(my_data))], ~ 
     .x %>%
       select(header_cols[1:2], 
            matches("NAICS\\.(code|based\\.rollup\\.code)")) %>%
       rename_at(matches("NAICS"), ~ str_remove(., "^X\\d{4}\\.")))

Or with base R using lapply或使用base R使用lapply

v1 <- c("Year", "state_name", "VoS_thousUSD", "NAICS_code", "industry")

out <- lapply(my_data[grep('^ASM', names(my_data))],
       function(x) x %>%
           mutate_if(is.factor, as.character) %>%
          select( header_cols, contains( paste0( name_pattern ) ),
                  contains( paste0( VoS_pattern ) ),
                  contains( paste0( NAICS_pattern ) ),
                 contains( paste0( industry_pattern ) ),
                -contains ( paste0( relative_pattern) ) ) %>% 
                set_names(v1))

combined_data <- do.call(rbind, out)
row.names(combined_data) <- NULL


# Make VoS numeric
combined_data_new <- combined_data %>%
        dplyr::mutate( VoS_thousUSD = as.numeric( VoS_thousUSD ) )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 合并2个数据框,其中多个列具有不同的列名 - Merge 2 dataframes with multiple columns of different column names 通过动态选择名称来组合多个数据框 - Combine multiple dataframes by selecting names dynamically 如何合并具有不同列名称的多个数据框 - How to merge multiple dataframes with different column names 如何连接具有不同列名的多个数据框? - How to concatenate multiple dataframes with different column names? 如何根据 R 中的字典在多个数据框中重命名具有不同列名和不同顺序的多个列 - How to rename multiple columns with different column names and different order in several dataframes based on a dictionary in R 组合2个具有不同列名的数据帧 - combine 2 dataframes having different column names 在R中转置并合并具有缺失数据和空白列名称的多个数据帧/在dcast之前重命名融化的列 - In R transpose and combine multiple dataframes with missing data and blank column names / rename melted columns prior to dcast 如何将多个数据框的特定列的值更改为数据框名称本身的值? - How to change the value of a specific column of multiple dataframes to the value of the dataframes' names themselves? 在R中:合并不同数据框的列 - in R: combine columns of different dataframes 合并两列不同的数据框 - Combine two columns of different dataframes
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM