[英]R data.table: reformat sub-headers into separate column
Revision of a previous question to include edge cases. 修订先前的问题以包括边缘案例。
I am trying to clean up a dataset of crime data by giving it better categorical labels. 我正在尝试通过提供更好的分类标签来清理犯罪数据的数据集。 A sample of the table looks like this:
该表的示例如下所示:
d <- as.data.table(read.csv('[filepath]'))
print(d)
Classifications ucr_ncic_code
SOVEREIGNTY NA
Treason 101
Treason Misprison 102
Espionage 103
Sovereignty 199
MILITARY (restricted to agencies) NA
Military Desertion 201
Military 299
IMMIGRATION NA
Illegal Entry 301
False Citizenship 302
Smuggling Aliens 303
Immigration 399
CRIMES AGAINST PERSON 7099
HOMICIDE NA
Homicide Family-Gun 901
Homicide Family-Weapon 902
Homicide Nonfam-Gun 903
PROPERTY CRIMES 7199
<TRUNCATED>
As you can see, in the original dataset the broader categories of crime classifications are formatted as all-caps headers and most have an NA code (eg SOVEREIGNTY NA
). 如您所见,在原始数据集中,更广泛的犯罪分类类别被格式化为全大写标题,并且大多数具有NA代码(例如
SOVEREIGNTY NA
)。 However, some headers include non-caps characters (eg MILITARY (restricted to agencies)
), and some headers don't have any sub-categories and therefore have a valid code (eg CRIMES AGAINST PERSON 7099
). 但是,某些标头包含非大写字符(例如
MILITARY (restricted to agencies)
),而某些标头没有任何子类别,因此具有有效的代码(例如CRIMES AGAINST PERSON 7099
)。 What I would like to do is reformat the data so that these headers are their own categorial column in the table. 我想做的是重新格式化数据,以便这些标头是表中自己的类别列。
Here is my initial solution, which I am almost sure is not the best approach, but produces the desired result: 这是我的初始解决方案,我几乎可以肯定这不是最好的方法,但是会产生理想的结果:
d[,row.num := .I,]
d.categs <- d[toupper(substr(Classifications,1,3))==substr(Classifications,1,3)]
#the substring is for some edge cases that I don't show here
setnames(d.categs, "Classifications", "Category")
d <- merge(d,d.categs[,row.num,list(Category)],'row.num', all.x=TRUE)
d <- d[order(row.num)]
prev.row <- NA
for (i in seq(1,d[,.N])) {
current.row <- d$Category[i]
if (is.na(current.row) & !(is.na(prev.row))){
d$Category[i] <- prev.row
}
prev.row <- d$Category[i]
}
#clean up
d <- d[!(is.na(ucr_ncic_code))]
d[,row.num := NULL,]
print(d)
Classifications ucr_ncic_code Category
Treason 101 SOVEREIGNTY
Treason Misprison 102 SOVEREIGNTY
Espionage 103 SOVEREIGNTY
Sovereignty 199 SOVEREIGNTY
Military Desertion 201 MILITARY (restricted to agencies)
Military 299 MILITARY (restricted to agencies)
Illegal Entry 301 IMMIGRATION
False Citizenship 302 IMMIGRATION
Smuggling Aliens 303 IMMIGRATION
Immigration 399 IMMIGRATION
CRIMES AGAINST PERSON 7099 CRIMES AGAINST PERSON
Homicide Family-Gun 901 HOMICIDE
Homicide Family-Weapon 902 HOMICIDE
Homicide Nonfam-Gun 903 HOMICIDE
PROPERTY CRIMES 7099 PROPERTY CRIMES
<TRUNCATED>
What would be a better way to utilize the data.table package to make this formatting change? 什么是利用data.table包进行此格式更改的更好方法? I'm guessing there's a better way to copy cells down than the for-loop that I designed, but many simpler solutions are hindered by the inconsistencies of character formatting in the headers and their codes, or lack thereof (see previous question ).
我猜想有一种比我设计的for循环更好的方式来复制单元格,但是由于标头及其代码中的字符格式不一致或缺乏(请参见上一个问题 ),阻碍了许多更简单的解决方案。
It should only take one line: 它只需要一行:
dt[,Category := Classifications[(x=grepl("^[A-Z]{2,}", Classifications))][cumsum(x)]][]
# Classifications ucr_ncic_code Category
# 1: SOVEREIGNTY NA SOVEREIGNTY
# 2: Treason 101 SOVEREIGNTY
# 3: Treason Misprison 102 SOVEREIGNTY
# 4: Espionage 103 SOVEREIGNTY
# 5: Sovereignty 199 SOVEREIGNTY
# 6: MILITARY (restricted to agencies) NA MILITARY (restricted to agencies)
# 7: Military Desertion 201 MILITARY (restricted to agencies)
# 8: Military 299 MILITARY (restricted to agencies)
# 9: IMMIGRATION NA IMMIGRATION
# 10: Illegal Entry 301 IMMIGRATION
Explanation 说明
Try creating an index that marks the changing categories. 尝试创建一个标记来标记不断变化的类别。 We need a pattern that can identify each change like
"^[AZ]{2,}"
. 我们需要一个可以识别每个更改的模式,例如
"^[AZ]{2,}"
。 This is a simple regular expression that matches two or more capital letters at the start of Classifications
. 这是一个简单的正则表达式,可在
Classifications
开始时匹配两个或多个大写字母。 After identifying the heading rows, we can take the cumulative sum of that index. 在确定标题行之后,我们可以获取该索引的累积总和。 It sounds odd at first but what's happening under the hood is a conversion from logical to numeric.
乍一看听起来很奇怪,但实际上是从逻辑到数字的转换。 Each
TRUE
will become 1
. 每个
TRUE
将变为1
。 When added together it becomes a subsettable index (ie 1 1 1 2 2 3 3 3...
): 当加在一起时,它成为一个子集索引(即
1 1 1 2 2 3 3 3...
):
I should also mention an R trick in there. 我还应该提到一个R技巧。 I created a new variable and used it in the same line.
我创建了一个新变量,并在同一行中使用了它。 You are allowed to do
(x=1+1) + x
to get 4
in R for example. 例如,您被允许做
(x=1+1) + x
以得到R中的4
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.