在R中找到两个数据集之间的相关性

Question

更新的数据集2和1结构：抱歉，这次突然更新。 我有两个数据集。 我的第一个数据集的结构是（当在R使用print(matr1)时）：

        month_year  income
 [1,]  "Jan 2000"  "30000"
 [2,]  "Feb 2000"  "12364"
 [3,]  "Mar 2000"  "37485"
 [4,]  "Apr 2000"  "2000"
 [5,]  "Jun 2000"  "7573"
          .     .      .
          .     .      .

因此，第一个数据集 在每年的每个月 都有 一个收入值 。

我的第二个数据集的结构是（当在R使用print(matr2)时）：

     month_year     value
 [1,] "Jan 2000" "84737476"
 [2,] "Jan 2000" "39450334"
 [3,] "Jan 2000" "48384943"
 [4,] "Feb 2000" "12345678"
 [5,] "Feb 2000" "49595340"

          .     .      .
          .     .      .

因此，在第二个数据集中，我每年的每个月都有n （例如100，但并非始终恒定）。

这两个数据集在随后的很多年（例如，对于2000、2001等所有月份）都具有按月计算的值。 现在，我想找到这两个数据集之间的相关性，但是要逐月而不是整体地。 当我使用R命令cor(as.numeric(matr1[,"income"]),as.numeric(matr2[,"value"]))我得到了总体相关性，但是我希望每个月都具有相关性，而不是整个。 我想要这样的关联：

                  Jan | Feb | Mar | Apr | May | .....
Correlation        x  |  y  |  z  |  p  |  q  | .....

我的问题是：

如何获得每月的相关性值而不是整体相关性？

注意：我不确定我应该在此处还是在Cross Validated上发布此问题。 我已经针对该数据集发布了一个问题，该问题仅与获取关联时发生的错误有关，并且已从那里迁移到此处。 因此，如果我将其发布在错误的地方，请原谅。

UPDATE1：经过一些建议后，我修改了这篇文章，以指出正确的尺寸。 首先，目前的数据集采用矩阵格式，因此使用引号。 我可以按照一些评论的建议将其转换为data.frame ，但是现在我一直在通过使用as.numeric转换列来计算相关性。

Answer 1

也许您可以尝试：

dat1 <- structure(list(year = c(2000L, 2000L, 2000L, 2000L, 2000L, 2001L, 
2001L, 2001L, 2001L, 2001L), month = c(1L, 2L, 3L, 4L, 5L, 1L, 
2L, 3L, 4L, 5L), income = c(30000L, 12364L, 37485L, 2000L, 7573L, 
25000L, 14364L, 38485L, 4000L, 7873L)), .Names = c("year", "month", 
"income"), class = "data.frame", row.names = c(NA, -10L))

dat2 <- structure(list(month_year = c("Jan 2000", "Feb 2000", "Mar 2000", 
"Apr 2000", "May 2000", "Jan 2001", "Feb 2001", "Mar 2001", "Apr 2001", 
"May 2001"), value = c(84737476L, 39450334L, 48384943L, 12345678L, 
49595340L, 84337476L, 34450334L, 48984943L, 124545678L, 49525340L
)), .Names = c("month_year", "value"), class = "data.frame", row.names = c(NA, 
-10L))



 dat1$month_year <- paste(month.abb[dat1$month], dat1$year)
 dat1$month <- gsub(" \\d+","", dat1$month_year)
 dat2$month <- gsub(" \\d+","", dat2$month_year)
 dat1$indx <- with(dat1, ave(month, month, FUN=seq_along))
 dat2$indx <- with(dat2, ave(month, month, FUN=seq_along))
 dat1 <- dat1[,c(2,3,5)]
 dat2 <- dat2[,c(3,2,4)]
 colnames(dat2)[2] <- "income"

 library(reshape2)

 dat2C <- dcast(dat2, indx~month, value.var="income")
 dat1C <- dcast(dat1, indx~month, value.var="income")
 m1 <- as.matrix(dat1C[,-1])
 m2 <- as.matrix(dat2C[,-1])
 cor(m1,m2)
  diag(cor(m1,m2))
 # Apr Feb Jan Mar May 
  #1   -1   1   1  -1

另外，如果您可以将两个数据集合并在一起，则可以使用data.table来完成。 使用上面的dput()数据

 library(data.table)
 dat1$month_year <- paste(month.abb[dat1$month], dat1$year)
 dat1 <- dat1[,c(4,3)]
 setDT(dat1)
 setDT(dat2)
 setkey(dat2, month_year)

 dat2[dat1, income := i.income]
 dat2[,month:= gsub(" \\d+", "", month_year)][,cor(value, income), by=month] 
 #    month V1
 #1:   Apr  1
 #2:   Feb -1
 #3:   Jan  1
 #4:   Mar  1
 #5:   May -1

更新

dat1 <- structure(list(month_year = structure(c(5L, 3L, 8L, 1L, 7L, 6L, 
4L, 9L, 2L), .Label = c("Apr 2000", "Apr 2001", "Feb 2000", "Feb 2001", 
"Jan 2000", "Jan 2001", "Jun 2000", "Mar 2000", "Mar 2001"), class = "factor"), 
income = c(30000, 12364, 37485, 2000, 7573, 42000, 15764, 
38465, 5000)), .Names = c("month_year", "income"), row.names = c(NA, 
-9L), class = "data.frame")


 dat2 <-  structure(list(month_year = structure(c(5L, 5L, 5L, 3L, 3L, 7L, 
 7L, 7L, 1L, 1L, 6L, 6L, 4L, 4L, 8L, 8L, 2L, 2L, 2L, 2L), .Label = c("Apr 2000", 
 "Apr 2001", "Feb 2000", "Feb 2001", "Jan 2000", "Jan 2001", "Mar 2000", 
 "Mar 2001"), class = "factor"), value = c(84737476, 39450334, 
 48384973, 12345678, 49595340, 4534353, 43353325, 84333535, 35343232, 
 4334353, 3434353, 5355322, 5223345, 4523535, 345353, 32235, 423553, 
 233553, 423535, 884455)), .Names = c("month_year", "value"), row.names = c(NA, 
 -20L), class = "data.frame")


 datN <- merge(dat1, dat2, all=T)
 library(data.table)
 DT <- data.table(datN)
 DT[, month:= gsub(" \\d+", "", month_year)][,cor(value, income),by=month]
 #   month         V1
 #1:   Apr -0.7136049
 #2:   Feb -0.7037676
 #3:   Jan -0.8637808
 #4:   Jun         NA
 #5:   Mar -0.6484684

Answer 2

将您的数据放入带有月份，值和收入列的数据框中。 例如：

d = data.frame(month=rep(1:12,5),value=runif(60,10000000,60000000), income=runif(60,5000,40000))

> head(d)
  month    value   income
1     1 58348424 34478.63
2     2 59512513 16179.46
3     3 21844994 20961.56
4     4 25843593 38502.16
5     5 24805863 12397.32
6     6 24200966 24110.27

然后，就像使用dplyr进行按月分组并进行汇总一样简单：

> require(dplyr)
> d %.% group_by(month) %.% summarize(cor = cor(value, income))
Source: local data frame [12 x 2]

   month         cor
1      1  0.17774478
2      2 -0.61693145
3      3 -0.05692027
4      4 -0.44966542
5      5 -0.30049386
6      6  0.09447414
7      7  0.67567298
8      8  0.14363810
9      9 -0.71899361
10    10  0.20807679
11    11 -0.42560100
12    12  0.23584150

从日期字符串中获取月份号在许多其他地方都涉及...但是在这里，我将使用lubridate软件包。 对于第二个数据集中的月/年字符串，例如：

require(lubridate)
month(dmy(paste("01",dat2$month_year)))

返回月份号。 请注意在开头加上“ 01”以使其成为有效日期的技巧。

在R中找到两个数据集之间的相关性

问题描述

2 个解决方案

解决方案1
2 2014-08-10 08:44:07

更新

解决方案2
0 2014-08-10 09:21:24

在R中找到两个数据集之间的相关性

问题描述

2 个解决方案

解决方案1 2 2014-08-10 08:44:07

更新

解决方案2 0 2014-08-10 09:21:24

解决方案1
2 2014-08-10 08:44:07

解决方案2
0 2014-08-10 09:21:24