如何基于R中的词典术语列表对数据框中的单词进行计数

Question

I'm trying to perform a dictionary count of terms in a factor of strings. 我正在尝试对字符串中的术语进行字典计数。

I have a factor called Names. 我有一个叫做名字的因素。 An example below: 下面的例子：

[1] GP - Hyperion Planning Upgrade
[2] Application Support Renewal
[3] Oracle EBS upgrade 11.5 to 12.1
[4] Bluemix Services
[5] Cognos 11 Upgrade

I also have a list of dictionary terms called terms : 我也有一个称为terms的字典术语列表：

[1] "IBM"     "Oracle"     "SQL Server"     "Cognos"     "Azure"

What I need from R is to create a dataframe from the list of terms and a total count of each team from the Names factor. 我需要的是R，是根据terms列表创建一个数据框，并根据Names因子创建每个团队的总数。 Example: 例：

         term       count
1        IBM         3
2        Oracle      6
3        SQL Server  0
4        Cognos      2
5        Azure       9

Of note: the term can be in multiple times in a single name. 注意：该术语可以多次使用一个名称。 it counts as once. 它算一次。

I would like to ask if anyone has any examples on this that I could derive from. 我想问一下是否有人可以从中得到任何例子。 Thanks. 谢谢。

Answer 1

You could try this (changing the vector Names a little bit and assuming that you want case-insensitive matches): 您可以尝试这样做（稍微改变矢量Names并假设您要使用不区分大小写的匹配项）：

# input
Names <- as.character(Names)
Names
#[1] IBM GP - Hyperion IBM Planning Upgrade IBM"
#[2] Application Support Renewal"               
#[3] Oracle EBS upgrade 11.5 to 12.1"           
#[4] Bluemix Services IBM"                      
#[5] Cognos 11 Upgrade"  

terms <- c("IBM",     "Oracle",     "SQL Server",     "Cognos",     "Azure")
vgrepl <- Vectorize(grepl, 'pattern', SIMPLIFY = TRUE)
df <- +(vgrepl(tolower(terms), tolower(Names))) # case insensitive

df
#     ibm oracle sql server cognos azure
#[1,]   1      0          0      0     0
#[2,]   0      0          0      0     0
#[3,]   0      1          0      0     0
#[4,]   1      0          0      0     0
#[5,]   0      0          0      1     0

colSums(df)
#    ibm     oracle sql server     cognos      azure 
#     2          1          0          1          0 

data.frame(count=colSums(df))
#           count
#ibm            2
#oracle         1
#sql server     0
#cognos         1
#azure          0

[EDIT2] [EDIT2]

df <- data.frame(count=colSums(df))
df <- cbind.data.frame(terms=rownames(df), df)
df
#                terms count
#ibm               ibm     2
#oracle         oracle     1
#sql server sql server     0
#cognos         cognos     1
#azure           azure     0

Answer 2

Here's an example avoiding regex in favor of match : 这是一个避免使用正则表达式来支持match的示例：

names <- c(
"GP - Hyperion Planning Upgrade",
"Application Support Renewal",
"Oracle EBS upgrade 11.5 to 12.1",
"Bluemix Services",
"Cognos 11 Upgrade")

terms <- tolower(c("IBM", "Oracle", "SQL Server", "Cognos", "Azure"))

## Split your names on whitespace and match each token to the terms
counts <- lapply(strsplit(tolower(names), "\\s+"), match, terms, 0)

## index the terms using the match indices and table it
table(terms[unlist(counts)])

cognos oracle 
     1      1

如何基于R中的词典术语列表对数据框中的单词进行计数

问题描述

2 个解决方案

解决方案1
1 已采纳 2017-03-13 18:13:16

[EDIT2] [EDIT2]

解决方案2
1 2017-03-13 18:21:38

如何基于R中的词典术语列表对数据框中的单词进行计数

问题描述

2 个解决方案

解决方案1 1 已采纳 2017-03-13 18:13:16

[EDIT2] [EDIT2]

解决方案2 1 2017-03-13 18:21:38

解决方案1
1 已采纳 2017-03-13 18:13:16

解决方案2
1 2017-03-13 18:21:38