简体   繁体   中英

Spearman correlation and splitting 1 variable

       Year.Sales.Advertise.Employees
1               1985 1.05 162 32 
2               1986 1.26 285 47 
3               1987 1.47 540 23 
4               1988 2.16 261 68 
5               1989 1.95 360 32 
6                1990 2.4 690 17 
7               1991 2.37 495 58 
8               1992 3.15 948 75 
9               1993 3.57 720 98 
10             1994 4.41 1.14 43 
11             1995 4.5 1.395 76 
12             1996 5.61 1.56 89 
13            1997 5.19 1.38 108 
14             1998 5.67 1.26 76 
15             1999 5.16 1.71 65 
16              2000 6.84 1.86 93

I want to find the Spearman correlation between Sales and Advertise and ive been stuck for 3 hours please help. I think I have to separate the 1 variable into 5 variables but Im struggling.

We can use strsplit to split our data, ie

new_df <- setNames(data.frame(do.call(rbind, strsplit(df2$Year.Sales.Advertise.Employees, ' '))), 
                   strsplit(names(df2), '.', fixed = TRUE)[[1]])

which gives,

 Year Sales Advertise Employees 1 1985 1.05 162 32 2 1986 1.26 285 47 3 1987 1.47 540 23 4 1988 2.16 261 68 5 1989 1.95 360 32 6 1990 2.4 690 17 7 1991 2.37 495 58 8 1992 3.15 948 75 9 1993 3.57 720 98 10 1994 4.41 1.14 43 11 1995 4.5 1.395 76 12 1996 5.61 1.56 89 13 1997 5.19 1.38 108 14 1998 5.67 1.26 76 15 1999 5.16 1.71 65 16 2000 6.84 1.86 93

You can then use cor (ie cor(new_df$Advertise, new_df$Employees) ) to find correlations between any columns you want.

NOTE1: Make sure that your initial column is a character (not factor)

NOTE2: By default, cor function calculates the pearson correlation. For spearman, add the argument cor(..., method = "spearman") , as mentioned by @Base_R_Best_R.

DATA

dput(df2)
structure(list(Year.Sales.Advertise.Employees = c("1985 1.05 162 32", 
"1986 1.26 285 47", "1987 1.47 540 23", "1988 2.16 261 68", "1989 1.95 360 32", 
"1990 2.4 690 17", "1991 2.37 495 58", "1992 3.15 948 75", "1993 3.57 720 98", 
"1994 4.41 1.14 43", "1995 4.5 1.395 76", "1996 5.61 1.56 89", 
"1997 5.19 1.38 108", "1998 5.67 1.26 76", "1999 5.16 1.71 65", 
"2000 6.84 1.86 93")), class = "data.frame", row.names = c(NA, 
-16L))

Not sure if you are looking for something like below or other things

# split strings into separate columns
df <- `names<-`(data.frame(t(apply(df, 1, function(x) as.numeric(unlist(strsplit(x,split = " ")))))),
          unlist(strsplit(names(df),split = "\\.")))

# calculate correction coefficient
r <- cor(df$Sales,df$Advertise)

such that

> r
[1] -0.5624524

DATA

df <- structure(list(Year.Sales.Advertise.Employees = c("1985 1.05 162 32", 
"1986 1.26 285 47", "1987 1.47 540 23", "1988 2.16 261 68", "1989 1.95 360 32", 
"1990 2.4 690 17", "1991 2.37 495 58", "1992 3.15 948 75", "1993 3.57 720 98", 
"1994 4.41 1.14 43", "1995 4.5 1.395 76", "1996 5.61 1.56 89", 
"1997 5.19 1.38 108", "1998 5.67 1.26 76", "1999 5.16 1.71 65", 
"2000 6.84 1.86 93")), class = "data.frame", row.names = c(NA, 
-16L))

> df
   Year.Sales.Advertise.Employees
1                1985 1.05 162 32
2                1986 1.26 285 47
3                1987 1.47 540 23
4                1988 2.16 261 68
5                1989 1.95 360 32
6                 1990 2.4 690 17
7                1991 2.37 495 58
8                1992 3.15 948 75
9                1993 3.57 720 98
10              1994 4.41 1.14 43
11              1995 4.5 1.395 76
12              1996 5.61 1.56 89
13             1997 5.19 1.38 108
14              1998 5.67 1.26 76
15              1999 5.16 1.71 65
16              2000 6.84 1.86 93

If you're asking for the data to be split into 4 discrete columns, this should do it.

Your data in the question needed some cleaning. It probably needs more (manual) cleaning, as advertise falls from 720 to 1.14 between 1993 and 1994. That's likely from hundreds of thousands to millions.

x <- c("1985 1.05 162 32",
  "1986 1.26 285 47",
  "1987 1.47 540 23",
  "1988 2.16 261 68",
  "1989 1.95 360 32",
  "1990 2.4 690 17",
  "1991 2.37 495 58",
  "1992 3.15 948 75",
  "1993 3.57 720 98",
  "1994 4.41 1.14 43",
  "1995 4.5 1.395 76",
  "1996 5.61 1.56 89",
  "1997 5.19 1.38 108",
  "1998 5.67 1.26 76",
  "1999 5.16 1.71 65",
  "2000 6.84 1.86 93")

library(tidyverse)
clean_df <- x %>% 
  as.data.frame() %>% 
  separate('.',
           into = c('year','sales', 'advertise', 'empl'), 
           sep = ' ') %>%
  as_tibble() %>%
  mutate_all(as.numeric)

cor(clean_df$sales, clean_df$advertise, method = 'spearman')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM