简体   繁体   English

在R的同一列中合并两个带有数字和字符的数据框架

[英]Merging two data.frames with numbers and characters in same column in r

I have two data frames. 我有两个数据框。 One is a library of words with a corresponding number. 一个是带有相应编号的单词库。 The other is a question, I have 3. My original data has 2 million rows in the library and 1 million questions. 另一个是一个问题,我有3个。我的原始数据在库中有200万行,还有100万个问题。 As to why I'm using a for loop in the columns. 至于为什么在列中使用for循环。 The questions I have is why the first two questions which have numbers sort in the merge command, whereas the questions with only words do not sort. 我的问题是,为什么在合并命令中前两个带有数字的问题不排序,而只有单词的问题却不排序。 Any reasons why this could be. 任何可能的原因。 I have reproducible data, its a lot of code probably but if you run it will make more sense in the data.frames. 我有可重现的数据,可能有很多代码,但是如果运行,它将在data.frames中更有意义。 It should all work without any adjusting. 它应该全部工作,无需任何调整。 The data.tables are df = questions, df2 = library, output = what I want the output to look like, and DF = is what the actual output is. data.tables是df =问题,df2 =库,输出=我希望输出看起来像什么,而DF =是实际输出是什么。

words1<-c(1,2,3,"How","did","Quebec")
words2<-c(.24,.25,.66,"Why","does","volicty")
words3<-c("How","do","I","clean","a","car")
library<-c(1,3,.25,.66,"How","did","does","do","I","wash","a","Quebec","car","is")
embedding1<-c(.48,.68,.52,.39,.5,.6,.7,.8,.9,.3,.46,.48,.53,.42)
df <- data.frame(words1,words2,words3)
names(df)<-c("words1","words2","words3")


words1<-c(.48,NA,.68,.5,.6,.48)
words2<-c(NA,.52,.39,NA,.7,NA)
words3<-c(.5,.8,.9,NA,.46,.53)
output<-data.frame(words1,words2,words3)
#--------Upload 2nd dataset-------#
df2 <- data.frame(library,embedding1)
names(df2)<-c("library","embedding1")

#-----Find columns--------#
l=ncol(df)
l
mynames<-colnames(df)
head(mynames)


#------Combine and match libary to training data------#
require(gridExtra)
List = list()
for(name in mynames){
  df1<-df[,name]
  df1<-as.data.frame(df1)
  x_train2<-merge(x= df1, y = df2, 
                  by.x = "df1", by.y = 'library',all.x=T, sort=F)
  new_x_train2<-x_train2[duplicated(x_train2[,2]),]
  x_train2<-x_train2[,-1]
  x_train2<-as.data.frame(x_train2)
  names(x_train2) <- name
  List[[length(List)+1]] = x_train2
}
list<-List

DF  <-  as.data.frame(matrix(unlist(list), nrow=length(unlist(list[1]))))

You could do this with tidyverse . 您可以使用tidyverse进行此tidyverse Doing it this way leaves more NAs in your columns, but preserves the order, and I think it essentially does what you're looking for: 这样做可以在您的列中留下更多的NA,但可以保留顺序,我认为它基本上可以满足您的需求:

library(tidyverse)
library(reshape2)

 df %>% melt(id = NULL) %>% 
  inner_join(.,df2,  by = c("value" = "library")) %>% 
  spread(variable, embedding1) %>% 
  select(-value)

Resulting in: 导致:

   words1 words2 words3
1      NA   0.52     NA
2      NA   0.39     NA
3    0.48     NA     NA
4    0.68     NA     NA
5      NA     NA   0.46
6      NA     NA   0.53
7    0.60     NA     NA
8      NA     NA   0.80
9      NA   0.70     NA
10   0.50     NA   0.50
11     NA     NA   0.90
12   0.48     NA     NA

The main reason is because with merge , sorting is done. 主要原因是因为使用merge可以完成排序。 See ?merge : 参见?merge

The rows are by default lexicographically sorted on the common columns, but for sort = FALSE are in an unspecified order. 默认情况下,这些行在公共列上按字典顺序进行排序,但对于sort = FALSE,则未指定顺序。

If you walk through your loop step-by-step you'll see it in action. 如果循序渐进,您将看到它的实际效果。 Use dplyr::left_join instead, which preserves row-order. 请改用dplyr::left_join ,它保留行顺序。

df1 <- df[, "words1"]
df1 <- as.data.frame(df1)

> df1
     df1
1      1
2      2
3      3
4    How
5    did
6 Quebec

merge(x= df1, y = df2, 
      by.x = "df1", by.y = 'library', all.x=T, sort=F)

     df1 embedding1
1      1       0.48
2      3       0.68
3    How       0.50
4    did       0.60
5 Quebec       0.48
6      2         NA

left_join(x = df1, y = df2, by = c("df1" = "library"), all.x = T)

     df1 embedding1
1      1       0.48
2      2         NA
3      3       0.68
4    How       0.50
5    did       0.60
6 Quebec       0.48

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM