简体   繁体   English

使用多个数据帧和查找表在R中执行功能

[英]using multiple data frames and lookup table to perform functions in r

I'm new to r and have a complicated set of data so hope my explanation is correct. 我是r的新手,并且有一组复杂的数据,因此希望我的解释是正确的。 I have multiple data frames I need to use to perform a series of things. 我需要使用多个数据框来执行一系列操作。 Here's one example. 这是一个例子。 I have three data frames. 我有三个数据框。 One is a list of species names and corresponding codes: 一个是物种名称和相应代码的列表:

>df.sp
    Species Code
    Picea   PI
    Pinus   CA

Another is a list of sites with species abundance data for different locations (dir). 另一个是列出具有不同位置(目录)物种丰富度数据的站点列表。 Unfortunately, the order of the species are different. 不幸的是,物种的顺序是不同的。

>df.site
Site  dir total  t01 t02 t03 t04
2         Total   PI  CA  AB  T
2     N    9      1   5   na na
2                 AB  ZI PI CA
2     S    5     2   2  1  4
3                 DD  EE AB YT
3     N    6     1   1  5   3
3                 AB YT  EE  DD
3     S     5     4   3  1   1

Then I also have a data frame of traits corresponding to the species: 然后,我还有一个与物种相对应的性状数据框:

>df.trait
Species  leaft  rootl
Picea     0.01    1.2
Pinus     0.02    3.5

An example of one things I want to do is get the average value for each trait (df.trait$leaft and df.trait$rootl) for all the species per site (df.site$Site) and per site location (df.site$Site N, S). 我想做的一件事示例是,获取每个站点(df.site $ Site)和每个站点位置(df。)的所有物种的每个特征(df.trait $ leaft和df.trait $ rootl)的平均值。 site $ Site N,S)。 So the result would be for the first row: 因此结果将是第一行:

Site dir leaft rootl
2    N   0.015  2.35

I hope that makes sense. 我希望这是有道理的。 It is very complicated for me to think through how to go about. 对我来说,思考如何做是非常复杂的。 I've attempted working from this post and this (and many others) but got lost. 我曾尝试过这篇文章这篇文章 (以及许多其他文章),但是迷路了。 Thanks for the help. 谢谢您的帮助。 Really appreciated. 非常感谢。

UPDATE: Here is a sample of the actual df.site (reduced) using dput: 更新:这是使用dput实际df.site(简化)的示例:

> dput(head(df.site))
structure(list(Site = c(2L, 2L, 2L, 2L, 2L, 2L), dir = c("rep17316", 
"N", "", "S", "", "SE"), total = c("Total", "9", "", 
"10", "", "9"), t01 = c("PI", "4", "CA", "1", "SILLAC", 
"3"), t02 = c("CXBLAN", "3", "ZIZAUR", "4", "OENPIL", "2"), 
    t03 = c("ZIZAPT", "1", "ECHPUR", "2", "ASCSYR", "2")), .Names = c("site", "dir", "total", "t01", "t02", "t03"), row.names = 2:7, class = "data.frame")

You're going to have to first wrangle your data into a much cleaner form. 您将必须首先将数据整理成更简洁的形式。 I'm assuming the structure that you dput above is consistent throughout your df.site dataframe; 我假设你的结构dput上面是整个一致df.site数据帧; namely that rows are paired, the first of which specifies the species code, the second of which has a count (or some other collected data?). 也就是说,这些行是成对的,其中第一行指定种类代码,第二行具有计数(或其他收集的数据?)。

Starting with df as the dataframe that you dput() above, I'll first simulate some data for the other two dataframes: df作为您上面dput()的数据帧开始,我将首先为其他两个数据帧模拟一些数据:

df.sp <- data.frame(Species = paste0("species",1:8),
                    Code = c("ECHPUR", "CXBLAN", "ZIZAPT",
                             "CAMROT", "SILLAC", "OENPIL",
                             "ASCSYR", "ZIZAUR"))
df.sp
#>    Species   Code
#> 1 species1 ECHPUR
#> 2 species2 CXBLAN
#> 3 species3 ZIZAPT
#> 4 species4 CAMROT
#> 5 species5 SILLAC
#> 6 species6 OENPIL
#> 7 species7 ASCSYR
#> 8 species8 ZIZAUR

df.trait <- data.frame(Species = paste0("species",1:8),
                       leaft = round(runif(8, max=.2), 2),
                       rootl = round(runif(8, min=1, max=4),1))

df.trait
#>    Species leaft rootl
#> 1 species1  0.12   2.5
#> 2 species2  0.04   2.6
#> 3 species3  0.12   2.1
#> 4 species4  0.05   1.1
#> 5 species5  0.15   2.5
#> 6 species6  0.15   3.3
#> 7 species7  0.05   3.9
#> 8 species8  0.13   2.1

First, let's clean up df by moving these second rows containing collected data, and moving those values into a new set of columns: 首先,让我们清理df是移动包含收集数据的第二行,并将这些值移动到一组新列中:

library(dplyr)

df.clean <- df %>% 
  #for each row, copy the direction and total from the following row
  mutate_at(vars(matches("dir|total")), lead) %>% 
  #create new columns for observed data and fill in values from following row
  mutate_at(vars(matches("t\\d+$")), 
            .funs = funs(n = lead(.))) %>% 
  #filter to rows with species code in t01
  filter(t01 %in% df.sp$Code) %>% 
  #drop "total" column (doesn't make sense after reshape)
  select(-total)

df.clean
#>   site dir    t01    t02    t03 t01_n t02_n t03_n
#> 1    2   N ECHPUR CXBLAN ZIZAPT     4     3     1
#> 2    2   S CAMROT ZIZAUR ECHPUR     1     4     2
#> 3    2  SE SILLAC OENPIL ASCSYR     3     2     2

We now have two sets of corresponding columns which have species codes and values respectively. 现在,我们有两组相应的列,分别具有种类代码和值。 To reshape the dataframe into long form we'll use the melt() from the data.table package. 为了将数据框重塑为长格式,我们将使用data.table包中的melt() See the responses to this question for other examples of how to do this. 有关如何执行此操作的其他示例,请参见对此问题的答复。

library(data.table)

df.clean <- df.clean %>% 
  setDT() %>% #convert to data.table to use data.tabel::melt
  melt(measure.vars = patterns("t\\d+$", "_n$"),
       value.name = c("Code", "Count") ) %>% 
  #drop "variable" column, which isn't needed
  select(-variable)

Finally, join your three dataframes: 最后,加入您的三个数据框:

#merge tables together
df.summaries <- df.clean %>% 
  left_join(df.sp) %>% 
  left_join(df.trait)

At this point you should be able to summarize your data by whatever groupings you are interested in using group_by and summarise . 此时,你应该能够通过你有兴趣使用任何分组汇总数据group_bysummarise

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM