简体   繁体   English

将一列分成两列并保留分隔符

[英]Split one column into two columns and retaining the seperator

I have a very large data array: 我有一个非常大的数据数组:

'data.frame':   40525992 obs. of  14 variables:    
 $ INSTNM     : Factor w/ 7050 levels "A   W Healthcare Educators"     
 $ Total      : Factor w/ 3212 levels "1","10","100",    
 $ Crime_Type : Factor w/ 72 levels "MURD11","NEG_M11",    
 $ Count      : num  0 0 0 0 0 0 0 0 0 0 ...

The Crime_Type column contains the type of Crime and the Year, so "MURD11" is Murder in 2011. These are college campus crime statistics my kid is analyzing for her school project, I am helping when she is stuck. Crime_Type列包含犯罪类型和年份,因此“ MURD11”是2011年的谋杀案。这些是我的孩子正在为她的学校项目分析的大学校园犯罪统计数据,当她被困时,我会提供帮助。 I am currently stuck at creating a clean data file she can analyze 我目前只能创建一个可以分析的干净数据文件

Once i converted the wide file (all crime types '9' in columns) to a long file using 'gather' the file size is going from 300MB to 8 GB. 一旦我使用“收集”将宽文件(列中的所有犯罪类型为“ 9”,都转换为长文件)的文件大小从300MB变为8 GB。 The file I am working on is 8GB. 我正在处理的文件是8GB。 do you that is the problem. 那是问题吗? How do i convert it to a data.table for faster processing? 如何将其转换为data.table以进行更快的处理?

What I want to do is to split this 'Crime_Type' column into two columns 'Crime_Type' and 'Year'. 我想要做的是split这个“Crime_Type”列到两列“Crime_Type”和“年”。 The data contains alphanumeric and numbers. 数据包含字母数字和数字。 There are also some special characters like NEG_M which is 'Negligent Manslaughter'. 还有一些特殊字符,例如NEG_M,即“过失杀人狂”。

We will replace the full names later but can some one suggest on how I separate 稍后我们将替换全名,但有人可以建议我如何分开

MURD11 --> MURD and 11 (in two columns) NEG_M10 --> NEG_M and 10 (in two columns) MURD11-> MURD和11(两列)NEG_M10-> NEG_M和10(两列)

etc... 等等...

I have tried using, 我尝试使用

df <- separate(totallong, Crime_Type, into = c("Crime", "Year"), sep = "[:digit:]", extra = "merge")
df <- separate(totallong, Crime_Type, into = c("Year", "Temp"), sep = "[:alpha:]", extra = "merge")

The first one separates the Crime as it looks for numbers. 第一个在寻找数字时将犯罪分开。 The second one does not work at all. 第二个根本不起作用。

I also tried 我也试过

df$Crime_Type<- apply (strsplit(as.character(df$Crime_Type), split="[:digit:]"))

That does not work at all. 那根本不起作用。 I have gone through many posts on stack-overflow and thats where I got these commands but I am now truly stuck and would appreciate your help. 我已经阅读了很多有关堆栈溢出的文章,这就是我得到这些命令的地方,但是现在我真的很受困扰,感谢您的帮助。

Since you're using tidyr already (as evidenced by separate ), try the extract function, which, given a regex, puts each captured group into a new column. 由于您已经在使用tidyr (如separate ),请尝试extract函数,给定一个正则表达式,它将每个捕获的组放入一个新列中。 The 'Crime_Type' is all the non-numeric stuff, and the 'Year' is the numeric stuff. “ Crime_Type”是所有非数字内容,“ Year”是数字内容。 Adjust the regex accordingly. 相应地调整正则表达式。

library(tidyr)
extract(df, 'Crime_Type', into=c('Crime', 'Year'), regex='^([^0-9]+)([0-9]+)$')

In base R , one option would be to create a unique delimiter between the non-numeric and numeric part. base R ,一种选择是在非数字部分和数字部分之间创建唯一的分隔符。 We can capture as a group the non-numeric ( [^0-9]+ ) and numeric ( [0-9]+ ) characters by wrapping it inside the parentheses ( (..) ) and in the replacement we use \\\\1 for the first capture group, followed by a , and the second group ( \\\\2 ). 通过将非数字( [^0-9]+ )和数字( [0-9]+ )字符括在括号( (..) )中,可以将它们捕获为一组,在替换中,我们使用\\\\1对于第一捕获基团,接着一个,并且所述第二组( \\\\2 )。 This can be used as input vector to read.table with sep=',' to read as two columns. 可以将其用作带有sep=',' read.table输入向量sep=','以读取为两列。

 df1 <- read.table(text=gsub('([^0-9]+)([0-9]+)', '\\1,\\2', 
                   totallong$Crime_Type),sep=",", col.names=c('Crime', 'Year'))
 df1
 #   Crime Year
 #1  MURD   11
 #2 NEG_M   11

If we need, we can cbind with the original dataset 如果需要,可以与原始数据集cbind

cbind(totallong, df1)

Or in base R , we can use strsplit with split specifying the boundary between non-number ( (?<=[^0-9]) ) and a number ( (?=[0-9]) ). 或在base R ,我们可以将strsplitsplit一起使用,以指定非数字( (?<=[^0-9]) )和数字( (?=[0-9]) )之间的边界。 Here we use lookarounds to match the boundary. 在这里,我们使用lookarounds来匹配边界。 The output will be a list , we can rbind the list elements with do.call(rbind and convert it to data.frame 输出将是一个list ,我们可以rbindlist的元素do.call(rbind并将其转换为data.frame

as.data.frame(do.call(rbind, strsplit(as.character(totallong$Crime_Type), 
                        split="(?<=[^0-9])(?=[0-9])", perl=TRUE)))
#     V1 V2
#1  MURD 11
#2 NEG_M 11

Or another option is tstrsplit from the devel version of data.table ie. 或者从tstrsplit的开发版本中data.table即。 v1.9.5 . v1.9.5 Here also, we use the same regex . 同样在这里,我们使用相同的regex In addition, there is option to convert the output columns into different class . 此外,还可以选择将输出列转换为不同的class

library(data.table)#v1.9.5+
setDT(totallong)[, c('Crime', 'Year') := tstrsplit(Crime_Type, 
    "(?<=[^0-9])(?=[0-9])",  perl=TRUE, type.convert=TRUE)]
#   Crime_Type Crime Year
#1:     MURD11  MURD   11
#2:    NEG_M11 NEG_M   11

If we don't need the 'Crime_Type' column in the output, it can be assigned to NULL 如果我们在输出中不需要'Crime_Type'列,则可以将其分配为NULL

totallong[, Crime_Type:= NULL]

NOTE: Instructions to install the devel version are here 注意:安装说明版本的说明在here


Or a faster option would be stri_extract_all from library(stringi) after collapsing the rows to a single string ('v2'). 或更快的办法是stri_extract_alllibrary(stringi)折叠行一个字符串(“V2”)后。 The alternate elements in 'v3' can be extracted by indexing with seq to create new data.frame 可以通过使用seq索引来提取'v3'中的备用元素以创建新data.frame

library(stringi)
v2 <- paste(totallong$Crime_Type, collapse='')
v3 <- stri_extract_all(v2, regex='\\d+|\\D+')[[1]]
ind1 <- seq(1, length(v3), by=2)
ind2 <- seq(2, length(v3), by=2)
d1 <- data.frame(Crime=v3[ind1], Year= v3[ind2])

Benchmarks 基准测试

v1 <- do.call(paste, c(expand.grid(c('MURD', 'NEG_M'), 11:15), sep=''))
set.seed(24)
test <- data.frame(v1= sample(v1, 40525992, replace=TRUE ))

system.time({
   v2 <- paste(test$v1, collapse='')
   v3 <- stri_extract_all(v2, regex='\\d+|\\D+')[[1]]
   ind1 <- seq(1, length(v3), by=2)
   ind2 <- seq(2, length(v3), by=2)
   d1 <- data.frame(Crime=v3[ind1], Year= v3[ind2])
 })
 #user  system elapsed 
 #56.019   1.709  57.838 

data 数据

totallong <- data.frame(Crime_Type= c('MURD11', 'NEG_M11'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM