简体   繁体   English

R strip在数据帧中分割一列

[英]R strip split a column in dataframe

I have a 'data' frame, with multiple columns, one of them being 'Runtime' which has data in two formats: 我有一个'数据'框架,有多列,其中一个是'Runtime',它有两种格式的数据:

Runtime
1 h 10 min
67 min
1 h 0 min
86 min
97 min

I want to convert all of them into Minutes. 我想将它们全部转换成分钟。 Have tried 'strsplit' and 'strip_split_fixed'. 尝试'strsplit'和'strip_split_fixed'。 Can anyone show me a way to achieve my goal, split or any other method? 谁能告诉我一个实现目标,分裂或任何其他方法的方法?

Thank you in advance ! 先感谢您 !

I think I saw this kind of solution somewhere. 我想我在某个地方看到了这种解决方案。 Don't hit me. 不要打我。

df = data.frame(Runtime = c('1 h 10 min', '67 min', '1 h 0 min', '86 min', '97 min'))

df$exp <- gsub("h", "* 60 +", df$Runtime)
df$exp <- gsub("min", "* 1", df$exp)

sapply(df$exp, FUN = function(x) eval(parse(text = x)))

1 * 60 + 10 * 1          67 * 1  1 * 60 + 0 * 1          86 * 1          97 * 1 
             70              67              60              86              97 

You can get it one call using gsubfn and regex: 您可以使用gsubfn和regex进行一次调用:

library(gsubfn)
gsubfn("^(?:(\\d+)\\s*h)?\\s*(\\d+)\\s*min.*$",
 ~ sum(as.numeric(x) * 60, as.numeric(y), as.numeric(z), na.rm=TRUE), x)
#[1] "70" "67" "60" "86" "97"

Here's an example of how you can do it: 这是一个如何做到这一点的例子:

# setting up your data.frame of interest
df = data.frame(Runtime = c('1 h 10 min', '67 min', '1 h 0 min', '86 min', '97 min'))



df$Runtime = gsub(' min', '', df$Runtime) # remove the min labels
hrs = grepl('h', x = df$Runtime) # which values are in an "x h y min" format?
runtime_sub = sapply(strsplit(df[hrs, 'Runtime'], ' h '), function(i) sum(as.numeric(i) * c(60, 1))) # convert the "x h y min" entries into numeric values in minutes
df$Runtime = as.numeric(df$Runtime) # convert the vector to numeric (yes, it's supposed to return a warning. Ignore it.
df[hrs, 'Runtime'] = runtime_sub # add the converted values

This results in: 这导致:

  Runtime
1      70
2      67
3      60
4      86
5      97

1) Read df[[1]] and if the third column is NA then the first column gives the minutes; 1)读取df[[1]] ,如果第三列是NA,则第一列给出分钟; otherwise, 60 times the first column plus the third column gives the minutes: 否则,第一列加上第三列的60倍给出分钟:

with(read.table(text = as.character(df[[1]]), fill = TRUE), 
        ifelse(is.na(V3), V1, 60*V1 + V3))
## [1] 70 67 60 86 97

2) A variation is to paste "0 h" at the beginning of each component that does not have an h giving hm and read that computing 60 times the first column plus the third column. 2)一种变化是在每个没有给出hm组件的开头粘贴“ hm并读取计算第一列加第三列的60倍。

hm <- paste(ifelse(grepl("h", df[[1]]), "", "0 h"), df[[1]])
with(read.table(text = hm), 60 * V1 + V3)
## [1] 70 67 60 86 97

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM