簡體   English   中英

基於來自另一個數據框的列創建新的數據框行

[英]Create new data frame rows based on a column from another data frame

我有2個數據幀,其中一個的第一列是一個列表(df A),另一個的第一列包含該列表中的項目,但在某些情況下,每一行都有多個項目(df B)。 我想要做的是遍歷並為dA中的每一項創建新行,該行出現在df B的第一列中。

DF A

dfA
  Index  X
1  1    alpha
2  2    beta
3  3    gamma
4  4    delta

DF B

dfB
  list    X  
1  1 4    alpha
2  3 2 1  beta
3  4 1 2  gamma
4  3      delta

期望的

dfC
  Index   x
1  1     alpha
2  4     alpha
3  3     beta
4  2     beta
5  1     beta
6  4     gamma
7  1     gamma
8  2     gamma
9  3     delta

我正在使用的實際數據:DF A

dput(head(allwines))
structure(list(Wine = c("Albariño", "Aligoté", "Amarone", "Arneis", 
"Asti Spumante", "Auslese"), Description = c("Spanish white wine grape that makes crisp, refreshing, and light-bodied wines.", 
"White wine grape grown in Burgundy making medium-bodied, crisp, dry wines with spicy character.", 
"From Italy’s Veneto Region a strong, dry, long- lived red, made from a blend of partially dried red grapes.", 
"A light-bodied dry wine the Piedmont Region of Italy", "From the Piedmont Region of Italy, A semidry sparkling wine produced from the Moscato di Canelli grape in the village of Asti", 
"German white wine from grapes that are very ripe and thus high in sugar"
)), .Names = c("Wine", "Description"), row.names = c(NA, 6L), class = "data.frame")

DF B

> dput(head(cheesePairing))
structure(list(Wine = c("Cabernet Sauvignon\r\n                                \r\n                            \r\n                        \r\n                            \r\n                                \r\n                                    Pinot Noir\r\n                                \r\n                            \r\n                        \r\n                            \r\n                                \r\n                                    Sauvignon Blanc\r\n                                \r\n                            \r\n                        \r\n                            \r\n                                \r\n                                    Zinfandel", 
"Chianti\r\n                                \r\n                            \r\n                        \r\n                            \r\n                                \r\n                                    Pinot Noir\r\n                                \r\n                            \r\n                        \r\n                            \r\n                                \r\n                                    Sangiovese", 
"Chardonnay", "Bardolino\r\n                                \r\n                            \r\n                        \r\n                            \r\n                                \r\n                                    Malbec\r\n                                \r\n                            \r\n                        \r\n                            \r\n                                \r\n                                    Riesling\r\n                                \r\n                            \r\n                        \r\n                            \r\n                                \r\n                                    Rioja\r\n                                \r\n                            \r\n                        \r\n                            \r\n                                \r\n                                    Sauvignon Blanc", 
"Tempranillo", "Asti Spumante"), Cheese = c("Abbaye De Belloc Cheese", 
"Ardrahan cheese", "Asadero cheese", "Asiago cheese", "Azeitao", 
"Baby Swiss Cheese"), Suggestions = c("Pair with apples,  sliced pears OR  a sampling of olives and thin sliced salami.  Pass around slices of baguette.", 
"Serve with a substantial wheat cracker and apples or grapes.", 
"Rajas (blistered chile strips) fresh corn tortillas", "Table water crackers, raw nuts (almond, walnuts)", 
"Nutty brown bread, grapes", "Server with dried fruits, whole grain, nutty breads, nuts"
)), .Names = c("Wine", "Cheese", "Suggestions"), row.names = c(NA, 
6L), class = "data.frame")

基於Curt的答案,我想我找到了一個更有效的解決方案...假設我正確地解釋了您的目標。

我的評論代碼如下。 您應該能夠按原樣運行並獲得所需的dfC。 需要注意的一件事是,我假設(基於您的實際數據)分割符dfB $ Index的分隔符為“ \\ r \\ n”。

# set up fake data
dfA<-data.frame(Index=c('1','2','3','4'), X=c('alpha','beta','gamma','delta'))
dfB<-data.frame(Index=c('1 \r\n 4','3 \r\n 2 \r\n 1','4 \r\n 1 \r\n 2','3'), X=c('alpha','beta','gamma','delta'))

dfA$Index<-as.character(dfA$Index)
dfA$X<-as.character(dfA$X)
dfB$Index<-as.character(dfB$Index)
dfB$X<-as.character(dfB$X)


dfB_index_parsed<-strsplit(x=dfB$Index,"\r\n") # split Index of dfB by delimiter "\r\n" and store in a list
names(dfB_index_parsed)<-dfB$X
dfB_split_num<-lapply(dfB_index_parsed, length) # find the number of splits per row of dfB and store in a list
dfB_split_num_vec<-do.call('c', dfB_split_num) # convert number of splits above from list to vector

g<-do.call('c',dfB_index_parsed) # store all split values in a single vector
g<-gsub(' ','',g) # remove trailing/leading spaces that occur after split
names(g)<-rep(names(dfB_split_num_vec), dfB_split_num_vec ) # associate each split Index from dfB with X from dfB
g<-g[g %in% dfA$Index] # check which dfB$Index are in dfA$Index

dfC<-data.frame(Index=g, X=names(g)) # construct data.frame

首先,讓我為您的問題提供實用的答案。 我懷疑我的回答是否非常有效,但確實有效。

# construct toy data
dfA <- data.frame(index = 1:4, X = letters[1:4])

dfB <- data.frame(X = letters[1:4])
dfB$list_elements <- list(c(1, 4), c(3, 2, 1), c(4, 1, 2), c(3))

# define function that provides solution

unlist_merge_df <- function(listed_df, reference_df){
    # reference_df assumed to have columns "X" and "index"
    # listed_df assumed to have column "list_elements"
    df_out <- data.frame(index = c(), X = c())
    my_list <- listed_df$list_elements
    for(idx in 1:length(my_list)){
        df_out <- rbind(df_out, 
                        data.frame(index = my_list[[idx]], 
                                   X = listed_df[idx, 'X'])
                        )
    }
    return(df_out)
}

# call the function
dfC <- unlist_merge_df(dfB, dfA)

# show output in human and R-parseable formats
dfC

dput(dfC)

輸出為:

index   X
1   1   a
2   4   a
3   3   b
4   2   b
5   1   b
6   4   c
7   1   c
8   2   c
9   3   d

structure(list(index = c(1, 4, 3, 2, 1, 4, 1, 2, 3), X = structure(c(1L, 
1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L), .Label = c("a", "b", "c", "d"
), class = "factor")), .Names = c("index", "X"), row.names = c(NA, 
9L), class = "data.frame")

其次,我要說的是,您所處的情況不是很理想。 如果可以避免,您可能應該這樣做。 要么根本不使用數據幀,而僅使用列表,要么避免完全構建列出的數據幀(如果可以),然后直接構建所需的輸出。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM