简体   繁体   English

data.frame 中的字符串变换矢量元素

[英]String-transform vector-elements in data.frame

I have a huge data frame df , with in one column a 'year-month' value as follows: "YYYYMM".我有一个巨大的数据框df ,在一列中有一个“年月”值,如下所示:“YYYYMM”。 Currently the data type is a number.当前数据类型是数字。 Snapshot:快照:

> df[[1]][1:10]
[1] 201001 201001 201001 201001 201001 201001 201001 201001 201001 201001
> str(df)
'data.frame':   2982393 obs. of  11 variables:
 $ YearMonth    : int  201001 201001 201001 201001 201001 201001 201001 201001 201001 201001 ...
 $ ...

What I want is to transform this value to a string (eventually to a factor) in the form "YYYY-MM", to be able to compare this with other data frames.我想要的是将此值转换为“YYYY-MM”形式的字符串(最终转换为一个因子),以便能够将其与其他数据帧进行比较。

I'm struggling to find an easy way to transform the value.我正在努力寻找一种简单的方法来转换价值。

I tried using as.Date and the format function. But as the values do not have any days, it didn't work for Strings.我尝试使用as.Dateformat function。但由于值没有任何日期,因此它不适用于字符串。 With Numerics (same with dataframe column) I even got other problems.使用 Numerics(与 dataframe 列相同)我什至遇到了其他问题。

> as.Date("201001", format = "%Y%m")
 [1] NA

> as.Date(201001, format = "%Y%m")
 Error in as.Date.numeric(201001, format = "%Y%m") : 
    'origin' must be supplied
> as.Date(df[[1]], format = "%Y%m")
 Error in as.Date.numeric(df[[1]], format = "%Y%m") : 
    'origin' must be supplied

I'm able to transform just one value, using subset and concatenation of strings.我只能使用字符串的subset和串联来转换一个值。 I wrote the formula below, to handle one element:我写了下面的公式来处理一个元素:

transformString <- function( x ) { # x = value
    return ( paste(cbind(substring(x, 1, 4),"-",substring(x,5,6)), collapse = '') )
}

Problem: I didn't find an easy way to apply that function to a whole column of an data.frame, other than just traversing all elements:问题:除了遍历所有元素之外,我没有找到将 function 应用于 data.frame 的整个列的简单方法:

transformStringVector <- function( x ) { # x = vector
    for(i in 1:length(x)) {
       x[i]<-transformString(x[i])
    }
    return ( x )
}

This is far from elegant and bad for performance.这远非优雅且不利于性能。 I tried to use apply (see below) and stuff like that, but was confronted with errors... (I admit I do not really get the apply function)我尝试使用apply (见下文)和类似的东西,但遇到了错误......(我承认我没有真正获得apply功能)

> temp <- apply(df[[1]], 1, transformString )
Error in apply(df[[1]], 1, transformString ) : 
  dim(X) must have a positive length

Does anybody have an alternative for this transformation within a huge data.frame?在巨大的数据框架中,有人可以替代这种转换吗? Or more in general;或者更笼统地说; an easy way to apply string-like-transformations to elements within a data.frame?一种将类似字符串的转换应用于 data.frame 中的元素的简单方法?

The reason why之所以

> as.Date("201001", format = "%Y%m")
 [1] NA

doesn't work, is that an R date needs a day component.不起作用,是 R 日期需要一天的组成部分。 Since your date doesn't provide one, you get a missing value.由于您的日期没有提供,因此您会得到一个缺失值。 To circumvent this, just add a day component:为了避免这种情况,只需添加一个 day 组件:

R> x = c("201001","201102")
R> x = paste(x, "01", sep="")

So I've made all the dates the first of the month:所以我把所有的日期都定在了这个月的第一天:

R> y = as.Date(x, "%Y%m%d")
[1] "2010-01-01" "2011-02-01"

You can then use format to get what you want:然后,您可以使用format来获取您想要的内容:

R> format(y, "%Y-%m")
[1] "2010-01" "2011-02"

If you're just looking to transform the column values into a string in the specified format and don't care about having the date format, substr() and paste() both take vectors as arguments:如果您只是想将列值转换为指定格式的字符串而不关心date格式,则substr()paste()都将向量作为 arguments:

xx<-c(201011,201003,201002,201010,201009,201005,201001,201001,201001,201001)

paste(substr(xx,1,4),substr(xx,5,6),sep="-")
# [1] "2010-11" "2010-03" "2010-02" "2010-10" "2010-09" "2010-05" "2010-01"
# [8] "2010-01" "2010-01" "2010-01"

In this way, you don't have to use apply()这样,您不必使用apply()

To answer your question about applying this to a data.frame specifically, you could access the column using the $ operator.要具体回答有关将此应用于 data.frame 的问题,您可以使用$运算符访问该列。 So you could use either of the functions offered here (I would have gone with the substr variant) to do it.因此,您可以使用此处提供的任一函数(我会使用 substr 变体)来执行此操作。 If you're planning to convert to a factor, I'd do that first.如果你打算转换为一个因素,我会先做。

> df <- data.frame(a=1:5,b=5:1,d=200101:200105)
> df
  a b      d
1 1 5 200101
2 2 4 200102
3 3 3 200103
4 4 2 200104
5 5 1 200105
> #Convert to a factor now for performance reasons.
> df$d <- as.factor(df$d)
> df$d <- paste(substr(df$d, 1, 4), "-", substr(df$d, 5,6), sep="")
> df
  a b       d
1 1 5 2001-01
2 2 4 2001-02
3 3 3 2001-03
4 4 2 2001-04
5 5 1 2001-05

> typeof(df$d)
[1] "character"
> df$d <- as.factor(df$d)
> df
  a b       d
1 1 5 2001-01
2 2 4 2001-02
3 3 3 2001-03
4 4 2 2001-04
5 5 1 2001-05
> typeof(df$d)
[1] "integer"

Note that depending on how "huge" your data.frame is, you might get better performance by converting to a factor first, then just converting the levels to hyphenated dates.请注意,根据您的 data.frame 的“巨大”程度,您可能会先转换为一个因素,然后再将级别转换为带连字符的日期,从而获得更好的性能。

> df <- data.frame(a=rep(1:5,1000000),b=rep(5:1,1000000),d=rep(200101:200105, 1000000))
> nrow(df)
 [1] 5000000
> # Hyphenate first
> system.time(df$d <- paste(substr(df$d, 1, 4), "-", substr(df$d, 5,6), sep="")) + system.time(df$d <- as.factor(df$d))
  user  system elapsed 
  9.65    0.61   10.31 
>
> #Factor first
> system.time(df$d <- as.factor(df$d)) + system.time(levels(df$d) <- paste(substr(levels(df$d), 1, 4), "-", substr(levels(df$d), 5,6), sep=""))
 user  system elapsed 
 0.68    0.25    0.93 

So, depending on the properties of your data.frame, you may be able to improve performance 10X by doing the factoring first.因此,根据 data.frame 的属性,您可以通过先进行因式分解来提高 10 倍的性能。

PS If you really care about performance, you might be able to get better properties on your factoring code (the slowest part of the fast solution) by using a hash-backed factor . PS 如果你真的关心性能,你可能能够通过使用哈希支持的 factor在你的分解代码(快速解决方案中最慢的部分)上获得更好的属性。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM