R data.table：将子标题重新格式化为单独的列

Question

Revision of a previous question to include edge cases. 修订先前的问题以包括边缘案例。

I am trying to clean up a dataset of crime data by giving it better categorical labels. 我正在尝试通过提供更好的分类标签来清理犯罪数据的数据集。 A sample of the table looks like this: 该表的示例如下所示：

d <- as.data.table(read.csv('[filepath]'))
print(d)

Classifications                    ucr_ncic_code
SOVEREIGNTY                        NA
Treason                            101
Treason Misprison                  102
Espionage                          103
Sovereignty                        199
MILITARY (restricted to agencies)  NA
Military Desertion                 201
Military                           299 
IMMIGRATION                        NA
Illegal Entry                      301
False Citizenship                  302
Smuggling Aliens                   303
Immigration                        399
CRIMES AGAINST PERSON              7099
HOMICIDE                           NA
Homicide Family-Gun                901
Homicide Family-Weapon             902
Homicide Nonfam-Gun                903
PROPERTY CRIMES                    7199
<TRUNCATED>

As you can see, in the original dataset the broader categories of crime classifications are formatted as all-caps headers and most have an NA code (eg SOVEREIGNTY NA ). 如您所见，在原始数据集中，更广泛的犯罪分类类别被格式化为全大写标题，并且大多数具有NA代码（例如SOVEREIGNTY NA ）。 However, some headers include non-caps characters (eg MILITARY (restricted to agencies) ), and some headers don't have any sub-categories and therefore have a valid code (eg CRIMES AGAINST PERSON 7099 ). 但是，某些标头包含非大写字符（例如MILITARY (restricted to agencies) ），而某些标头没有任何子类别，因此具有有效的代码（例如CRIMES AGAINST PERSON 7099 ）。 What I would like to do is reformat the data so that these headers are their own categorial column in the table. 我想做的是重新格式化数据，以便这些标头是表中自己的类别列。

Here is my initial solution, which I am almost sure is not the best approach, but produces the desired result: 这是我的初始解决方案，我几乎可以肯定这不是最好的方法，但是会产生理想的结果：

d[,row.num := .I,]
d.categs <- d[toupper(substr(Classifications,1,3))==substr(Classifications,1,3)] 
#the substring is for some edge cases that I don't show here

setnames(d.categs, "Classifications", "Category")
d <- merge(d,d.categs[,row.num,list(Category)],'row.num', all.x=TRUE)
d <- d[order(row.num)]

prev.row <- NA
for (i in seq(1,d[,.N])) {
  current.row <- d$Category[i]  
  if (is.na(current.row) & !(is.na(prev.row))){
    d$Category[i] <- prev.row
  } 
  prev.row <- d$Category[i]
}

#clean up
d <- d[!(is.na(ucr_ncic_code))]
d[,row.num := NULL,]

print(d)

Classifications   ucr_ncic_code   Category
Treason                 101       SOVEREIGNTY
Treason Misprison       102       SOVEREIGNTY
Espionage               103       SOVEREIGNTY
Sovereignty             199       SOVEREIGNTY
Military Desertion      201       MILITARY (restricted to agencies)
Military                299       MILITARY (restricted to agencies)
Illegal Entry           301       IMMIGRATION
False Citizenship       302       IMMIGRATION
Smuggling Aliens        303       IMMIGRATION
Immigration             399       IMMIGRATION
CRIMES AGAINST PERSON   7099      CRIMES AGAINST PERSON
Homicide Family-Gun     901       HOMICIDE
Homicide Family-Weapon  902       HOMICIDE
Homicide Nonfam-Gun     903       HOMICIDE
PROPERTY CRIMES         7099      PROPERTY CRIMES
<TRUNCATED>

What would be a better way to utilize the data.table package to make this formatting change? 什么是利用data.table包进行此格式更改的更好方法？ I'm guessing there's a better way to copy cells down than the for-loop that I designed, but many simpler solutions are hindered by the inconsistencies of character formatting in the headers and their codes, or lack thereof (see previous question ). 我猜想有一种比我设计的for循环更好的方式来复制单元格，但是由于标头及其代码中的字符格式不一致或缺乏（请参见上一个问题），阻碍了许多更简单的解决方案。

Answer 1

It should only take one line: 它只需要一行：

dt[,Category := Classifications[(x=grepl("^[A-Z]{2,}", Classifications))][cumsum(x)]][]
#                       Classifications ucr_ncic_code                          Category
#  1:                       SOVEREIGNTY            NA                       SOVEREIGNTY
#  2:                           Treason           101                       SOVEREIGNTY
#  3:                 Treason Misprison           102                       SOVEREIGNTY
#  4:                         Espionage           103                       SOVEREIGNTY
#  5:                       Sovereignty           199                       SOVEREIGNTY
#  6: MILITARY (restricted to agencies)            NA MILITARY (restricted to agencies)
#  7:                Military Desertion           201 MILITARY (restricted to agencies)
#  8:                          Military           299 MILITARY (restricted to agencies)
#  9:                       IMMIGRATION            NA                       IMMIGRATION
# 10:                     Illegal Entry           301                       IMMIGRATION

Explanation 说明

Try creating an index that marks the changing categories. 尝试创建一个标记来标记不断变化的类别。 We need a pattern that can identify each change like "^[AZ]{2,}" . 我们需要一个可以识别每个更改的模式，例如"^[AZ]{2,}" 。 This is a simple regular expression that matches two or more capital letters at the start of Classifications . 这是一个简单的正则表达式，可在Classifications开始时匹配两个或多个大写字母。 After identifying the heading rows, we can take the cumulative sum of that index. 在确定标题行之后，我们可以获取该索引的累积总和。 It sounds odd at first but what's happening under the hood is a conversion from logical to numeric. 乍一看听起来很奇怪，但实际上是从逻辑到数字的转换。 Each TRUE will become 1 . 每个TRUE将变为1 。 When added together it becomes a subsettable index (ie 1 1 1 2 2 3 3 3... ): 当加在一起时，它成为一个子集索引（即1 1 1 2 2 3 3 3... ）：

I should also mention an R trick in there. 我还应该提到一个R技巧。 I created a new variable and used it in the same line. 我创建了一个新变量，并在同一行中使用了它。 You are allowed to do (x=1+1) + x to get 4 in R for example. 例如，您被允许做(x=1+1) + x以得到R中的4 。

R data.table：将子标题重新格式化为单独的列

问题描述

1 个解决方案

解决方案1
3 2015-11-14 20:44:49

R data.table：将子标题重新格式化为单独的列

问题描述

1 个解决方案

解决方案1 3 2015-11-14 20:44:49

解决方案1
3 2015-11-14 20:44:49