简体   繁体   English

需要有关使用 R 清理数据的建议

[英]Need advice on using R to clean up data

I have multiple same format csv files that I need to combine but before that我有多个相同格式的 csv 文件需要合并,但在此之前

  1. Header is not the first row but 4th row. Header 不是第一行而是第四行。 Should I remove first 3 row by skip?我应该通过跳过删除前 3 行吗? Or should I reassign the header?还是我应该重新分配 header?
  2. I need to add in a column which is the ID of the file (same as file name) before I combine.在合并之前,我需要添加一个作为文件 ID 的列(与文件名相同)。
  3. Then I need to extract only 4 columns from a total of 7.然后我只需要从总共 7 列中提取 4 列。
  4. Sum up numbers under a category.总结一个类别下的数字。
  5. Combine all csv files into one.将所有 csv 文件合并为一个。

This is what I have so far where I do Step 1, 3, 4 then only 2 to add in a column then 5, not sure if I should add in the ID column first or not?这就是我到目前为止所做的第 1、3、4 步然后只有 2 个添加到一个列然后 5 个,不确定我是否应该先添加到 ID 列中?

files = list.files(pattern = "*.csv", full.names = TRUE)

library("tidyverse")
library("dplyr")

data = data.frame()

for (file in files){
    temp <- read.csv(file, skip=3, header = TRUE)
    colnames(temp) <- c("Volume", "Unit", "Category", "Surpass Object", "Time", "ID")
    temp <- temp [, c("Volume", "Category", "Surpass Object")]
    temp <- subset(temp, Category =="Surface")
    mutate(id = file)
    aggregate(temp$Volume, by=list(Category=temp$Category), FUN=sum)
    
}

And I got an error:我得到一个错误:

Error in is.data.frame(.data) : 
  argument ".data" is missing, with no default

The code is fine if I didn't put in the mutate line so I think the main problem comes from there but any advice will be appreciated.如果我没有放入 mutate 行,代码很好,所以我认为主要问题来自那里,但任何建议都会受到赞赏。

I am quite new to R and really appreciate all the comments that I can get here.我对 R 很陌生,非常感谢我能在这里得到的所有评论。

Thanks in advance!提前致谢!

Since you appear to be trying to use dplyr , I'll stick with that theme.由于您似乎正在尝试使用dplyr ,因此我将坚持使用该主题。

library(dplyr)
library(purrr)
files = list.files(pattern = "*.csv", full.names = TRUE)
results <- map_dfr(setNames(nm = files), ~ read.csv(.x, skip=3, header=TRUE), .id = "filename") %>%
  select(whatever, your, four_columns, are) %>%
  group_by(filename, Category) %>%
  summarize(Volume = sum(Volume))

Walk-through:演练:

  1. purrr::map_dfr iterates our function ( read.csv(...) ) over each of the inputs (each file in files ) and row-concatenates it. purrr::map_dfr在每个输入( files中的每个文件)上迭代我们的 function ( read.csv(...) ) 并将其行连接。 Since we named the files with themselves ( setNames(nm=files) is akin to names(files) <- files ), we can use id="filename" which adds a "filename" column that reflects from which file each row was taken.由于我们用它们自己命名文件( setNames(nm=files)类似于names(files) <- files ),我们可以使用id="filename"添加一个“filename”列,该列反映每行是从哪个文件中获取的.

  2. select(...) whatever four columns you said you needed. select(...)无论你说你需要什么四列。 Frankly, since you're aggregating, we really only need c("filename", "Category", "Volume") , anything else and you likely have missed something in your explanation.坦率地说,既然你在聚合,我们真的只需要c("filename", "Category", "Volume") ,其他任何东西,你可能在你的解释中遗漏了一些东西。

  3. group_by(..) will allow us to get one row for each filename, each Category , where Volume is a sum (calculated in the next step, summarize ). group_by(..)将允许我们为每个文件名、每个Category获取一行,其中Volume是一个总和(在下一步中计算, summarize )。

You can use read.csv() , but if there are many files, I suggest using the fread() from the data.table package.您可以使用read.csv() ,但如果文件很多,我建议使用data.table package 中的fread() It is significantly faster.它明显更快。 I used fread() here, but it will still work if you switch it out for read.csv() .我在这里使用了fread() ,但如果你将它切换为read.csv() ,它仍然可以工作。 fread() is more advanced, as well. fread()也更高级。 You will find that even things like skip can sometimes be left out, and it will still be read correctly.您会发现,有时甚至可以忽略诸如skip之类的内容,并且仍然可以正确读取。

library(tidyverse)
library(data.table)

add_filename <- function(flnm){
    fread(flnm, skip = 3) %>%   # read file
    mutate(id = basename(flnm)) # creates new col id w/ basename of the file 
}

# single data frame all CSVs; id in first col
df <- list.files(pattern = "*.csv", full.names = TRUE) %>%
    map_df(~add_filename) %>%
    select(id, Volume, Category, `Surpass Object`)

I get the impression that you wanted to aggregate but keep the consolidated data frame, as well.我得到的印象是您想要聚合但也保留合并的数据框。 If that's the case, you'll keep the aggregation separate from building the data frame.如果是这种情况,您将聚合与构建数据框分开。

df %>%       # not assigned to a new object, so only shown in console
    filter(Category == "Surface") %>%  # filter for the category desired
    {sum(.$Volume)}                    # sum the remaining values for volume

If you are not aware, the period in that last call is the data carried forward, so in this case, the filtered data.如果您不知道,最后一次调用中的时间段是结转的数据,因此在这种情况下,是过滤后的数据。 The simplest way (perhaps not the best way) to explain the {} is that sum() is not designed to handle data frames - therefore isn't inherently friendly with dyplr piping.解释 {} 的最简单方法(也许不是最好的方法)是sum()不是为处理数据帧而设计的 - 因此与dyplr管道本身并不友好。

If you wanted the sum of volume for every category instead of only "Surface" that you had coded in your question, then you would use this instead:如果您想要每个类别的总和,而不是您在问题中编码的"Surface" ,那么您可以使用它:

df %>% 
    group_by(Category) %>%
    summarise(sum(Volume))

Notice I used the British spelling of summarize here.请注意,我在这里使用了英式拼写summarise The function summarize() is in a lot of packages. function summarize()在很多包中。 I have just found it easier to use the British spelling for this function whenever I want to make sure it's the dplyr function that I've called.我刚刚发现,只要我想确保它是我调用的dplyr function 时,使用英国拼写更容易。 ( tidyverse accepts the American and British spelling for nearly all functions, I think.) (我认为, tidyverse几乎所有功能都接受美式和英式拼写。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM