[英]Split one column into two columns and retaining the seperator
I have a very large data array: 我有一个非常大的数据数组:
'data.frame': 40525992 obs. of 14 variables:
$ INSTNM : Factor w/ 7050 levels "A W Healthcare Educators"
$ Total : Factor w/ 3212 levels "1","10","100",
$ Crime_Type : Factor w/ 72 levels "MURD11","NEG_M11",
$ Count : num 0 0 0 0 0 0 0 0 0 0 ...
The Crime_Type column contains the type of Crime and the Year, so "MURD11" is Murder in 2011. These are college campus crime statistics my kid is analyzing for her school project, I am helping when she is stuck. Crime_Type列包含犯罪类型和年份,因此“ MURD11”是2011年的谋杀案。这些是我的孩子正在为她的学校项目分析的大学校园犯罪统计数据,当她被困时,我会提供帮助。 I am currently stuck at creating a clean data file she can analyze
我目前只能创建一个可以分析的干净数据文件
Once i converted the wide file (all crime types '9' in columns) to a long file using 'gather' the file size is going from 300MB to 8 GB. 一旦我使用“收集”将宽文件(列中的所有犯罪类型为“ 9”,都转换为长文件)的文件大小从300MB变为8 GB。 The file I am working on is 8GB.
我正在处理的文件是8GB。 do you that is the problem.
那是问题吗? How do i convert it to a data.table for faster processing?
如何将其转换为data.table以进行更快的处理?
What I want to do is to split
this 'Crime_Type' column into two columns 'Crime_Type' and 'Year'. 我想要做的是
split
这个“Crime_Type”列到两列“Crime_Type”和“年”。 The data contains alphanumeric and numbers. 数据包含字母数字和数字。 There are also some special characters like NEG_M which is 'Negligent Manslaughter'.
还有一些特殊字符,例如NEG_M,即“过失杀人狂”。
We will replace the full names later but can some one suggest on how I separate 稍后我们将替换全名,但有人可以建议我如何分开
MURD11 --> MURD and 11 (in two columns) NEG_M10 --> NEG_M and 10 (in two columns) MURD11-> MURD和11(两列)NEG_M10-> NEG_M和10(两列)
etc... 等等...
I have tried using, 我尝试使用
df <- separate(totallong, Crime_Type, into = c("Crime", "Year"), sep = "[:digit:]", extra = "merge")
df <- separate(totallong, Crime_Type, into = c("Year", "Temp"), sep = "[:alpha:]", extra = "merge")
The first one separates the Crime as it looks for numbers. 第一个在寻找数字时将犯罪分开。 The second one does not work at all.
第二个根本不起作用。
I also tried 我也试过
df$Crime_Type<- apply (strsplit(as.character(df$Crime_Type), split="[:digit:]"))
That does not work at all. 那根本不起作用。 I have gone through many posts on stack-overflow and thats where I got these commands but I am now truly stuck and would appreciate your help.
我已经阅读了很多有关堆栈溢出的文章,这就是我得到这些命令的地方,但是现在我真的很受困扰,感谢您的帮助。
Since you're using tidyr
already (as evidenced by separate
), try the extract
function, which, given a regex, puts each captured group into a new column. 由于您已经在使用
tidyr
(如separate
),请尝试extract
函数,给定一个正则表达式,它将每个捕获的组放入一个新列中。 The 'Crime_Type' is all the non-numeric stuff, and the 'Year' is the numeric stuff. “ Crime_Type”是所有非数字内容,“ Year”是数字内容。 Adjust the regex accordingly.
相应地调整正则表达式。
library(tidyr)
extract(df, 'Crime_Type', into=c('Crime', 'Year'), regex='^([^0-9]+)([0-9]+)$')
In base R
, one option would be to create a unique delimiter between the non-numeric and numeric part. 在
base R
,一种选择是在非数字部分和数字部分之间创建唯一的分隔符。 We can capture as a group the non-numeric ( [^0-9]+
) and numeric ( [0-9]+
) characters by wrapping it inside the parentheses ( (..)
) and in the replacement we use \\\\1
for the first capture group, followed by a ,
and the second group ( \\\\2
). 通过将非数字(
[^0-9]+
)和数字( [0-9]+
)字符括在括号( (..)
)中,可以将它们捕获为一组,在替换中,我们使用\\\\1
对于第一捕获基团,接着一个,
并且所述第二组( \\\\2
)。 This can be used as input vector to read.table
with sep=','
to read as two columns. 可以将其用作带有
sep=','
read.table
输入向量sep=','
以读取为两列。
df1 <- read.table(text=gsub('([^0-9]+)([0-9]+)', '\\1,\\2',
totallong$Crime_Type),sep=",", col.names=c('Crime', 'Year'))
df1
# Crime Year
#1 MURD 11
#2 NEG_M 11
If we need, we can cbind
with the original dataset 如果需要,可以与原始数据集
cbind
cbind(totallong, df1)
Or in base R
, we can use strsplit
with split
specifying the boundary between non-number ( (?<=[^0-9])
) and a number ( (?=[0-9])
). 或在
base R
,我们可以将strsplit
与split
一起使用,以指定非数字( (?<=[^0-9])
)和数字( (?=[0-9])
)之间的边界。 Here we use lookarounds
to match the boundary. 在这里,我们使用
lookarounds
来匹配边界。 The output will be a list
, we can rbind
the list
elements with do.call(rbind
and convert it to data.frame
输出将是一个
list
,我们可以rbind
该list
的元素do.call(rbind
并将其转换为data.frame
as.data.frame(do.call(rbind, strsplit(as.character(totallong$Crime_Type),
split="(?<=[^0-9])(?=[0-9])", perl=TRUE)))
# V1 V2
#1 MURD 11
#2 NEG_M 11
Or another option is tstrsplit
from the devel version of data.table
ie. 或者从
tstrsplit
的开发版本中data.table
即。 v1.9.5
. v1.9.5
。 Here also, we use the same regex
. 同样在这里,我们使用相同的
regex
。 In addition, there is option to convert the output columns into different class
. 此外,还可以选择将输出列转换为不同的
class
。
library(data.table)#v1.9.5+
setDT(totallong)[, c('Crime', 'Year') := tstrsplit(Crime_Type,
"(?<=[^0-9])(?=[0-9])", perl=TRUE, type.convert=TRUE)]
# Crime_Type Crime Year
#1: MURD11 MURD 11
#2: NEG_M11 NEG_M 11
If we don't need the 'Crime_Type' column in the output, it can be assigned to NULL
如果我们在输出中不需要'Crime_Type'列,则可以将其分配为
NULL
totallong[, Crime_Type:= NULL]
NOTE: Instructions to install the devel version are here
注意:安装说明版本的说明在
here
Or a faster option would be stri_extract_all
from library(stringi)
after collapsing the rows to a single string ('v2'). 或更快的办法是
stri_extract_all
从library(stringi)
折叠行一个字符串(“V2”)后。 The alternate elements in 'v3' can be extracted by indexing with seq
to create new data.frame
可以通过使用
seq
索引来提取'v3'中的备用元素以创建新data.frame
library(stringi)
v2 <- paste(totallong$Crime_Type, collapse='')
v3 <- stri_extract_all(v2, regex='\\d+|\\D+')[[1]]
ind1 <- seq(1, length(v3), by=2)
ind2 <- seq(2, length(v3), by=2)
d1 <- data.frame(Crime=v3[ind1], Year= v3[ind2])
v1 <- do.call(paste, c(expand.grid(c('MURD', 'NEG_M'), 11:15), sep=''))
set.seed(24)
test <- data.frame(v1= sample(v1, 40525992, replace=TRUE ))
system.time({
v2 <- paste(test$v1, collapse='')
v3 <- stri_extract_all(v2, regex='\\d+|\\D+')[[1]]
ind1 <- seq(1, length(v3), by=2)
ind2 <- seq(2, length(v3), by=2)
d1 <- data.frame(Crime=v3[ind1], Year= v3[ind2])
})
#user system elapsed
#56.019 1.709 57.838
totallong <- data.frame(Crime_Type= c('MURD11', 'NEG_M11'))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.