简体   繁体   English

如何在数据集列表中找到公用变量并在R中重塑它们?

[英]How to find common variables in a list of datasets & reshape them in R?

    setwd("C:\\Users\\DATA")
    temp = list.files(pattern="*.dta")
    for (i in 1:length(temp)) assign(temp[i], read.dta13(temp[i], nonint.factors = TRUE))
    grep(pattern="_m", temp, value=TRUE)

Here I create a list of my datasets and read them into R, I then attempt to use grep in order to find all variable names with pattern _m, obviously this doesn't work because this simply returns all filenames with pattern _m. 在这里,我创建了一个数据集列表,并将它们读入R,然后尝试使用grep来查找所有带有_m模式的变量名,显然这是行不通的,因为这只会返回所有带有_m模式的文件名。 So essentially what I want, is my code to loop through the list of databases, find variables ending with _m, and return a list of databases that contain these variables. 因此,本质上我想要的是我的代码循环遍历数据库列表,查找以_m结尾的变量,并返回包含这些变量的数据库列表。

Now I'm quite unsure how to do this, I'm quite new to coding and R. 现在我不确定如何执行此操作,我对编码和R很陌生。

Apart from needing to know in which databases these variables are, I also need to be able to make changes (reshape them) to these variables. 除了需要知道这些变量在哪个数据库中之外,我还需要能够对这些变量进行更改(重塑)。

Here is one way to figure out which files have variables with names ending in "_m": 这是一种找出哪些文件的变量名称以“ _m”结尾的方法:

# setup
setwd("C:\\Users\\DATA")
temp = list.files(pattern="*.dta")
# logical vector to be filled in
inFileVec <- logical(length(temp))

# loop through each file
for (i in 1:length(temp)) {
  # read file
  fileTemp <- read.dta13(temp[i], nonint.factors = TRUE)

  # fill in vector with TRUE if any variable ends in "_m"
  inFileVec[i] <- any(grepl("_m$", names(fileTemp)))
}

In the final line, names returns the variable names, grepl returns a logical vector for whether each variable name matches the pattern, and any returns a logical vector of length 1 indicating whether or not at least one TRUE was returned from grepl . 在最后一行, names返回变量名, grepl返回一个逻辑向量,用于确定每个变量名称是否与模式匹配,而any返回一个逻辑向量,其长度为1,指示是否从grepl返回了至少一个TRUE。

# print out these file names    
temp[inFileVec]

First, assign will not work as you think, because it expects a string (or character, as they are called in R). 首先, assign无法按照您的想法工作,因为它需要一个字符串(或字符,因为它们在R中被调用)。 It will use the first element as the variable (see here for more info). 它将使用第一个元素作为变量(有关更多信息,请参见此处 )。

What you can do depends on the structure of your data. 您可以做什么取决于数据的结构。 read.dta13 will load each file as a data.frame. read.dta13将每个文件作为data.frame加载。

If you look for column names, you can do something like that: 如果您查找列名,则可以执行以下操作:

myList <- character()
for (i in 1:length(temp)) {

    # save the content of your file in a data frame
    df <- read.dta13(temp[i], nonint.factors = TRUE))

    # identify the names of the columns matching your pattern
    varMatch <- grep(pattern="_m", colnames(df), value=TRUE)

    # check if at least one of the columns match the pattern
    if (length(varMatch)) {
        myList <- c(myList, temp[i]) # save the name if match
    }

}

If you look for the content of a column, you can have a look at the dplyr package, which is very useful when it comes to data frames manipulation. 如果您要查找列的内容,则可以查看dplyr包,它在处理数据帧时非常有用。

A good introduction to dplyr is available in the package vignette here . 一个很好的介绍dplyr是在包装的小插曲可以在这里

Note that in R, appending to a vector can become very slow (see this SO question for more details). 请注意,在R中,附加到向量可能会变得非常慢(有关更多详细信息,请参见此SO问题 )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM