[英]Automated tagging/text mining in excel
I have a monthly excel spreadsheet with the following:我有一个每月 excel 电子表格,其中包含以下内容:
Category![]() |
Description![]() |
---|---|
A![]() |
free text in paragraph form![]() |
B![]() |
free text in paragraph form![]() |
C ![]() |
free text in paragraph form![]() |
B![]() |
free text in paragraph form![]() |
B![]() |
free text in paragraph form![]() |
A![]() |
free text in paragraph form![]() |
I would like to add a third column that adds tags or keywords from a predetermined list that searches the free text and then pre-populates it based on whether one or more of the terms is found there or not.我想添加第三列,它从搜索自由文本的预定列表中添加标签或关键字,然后根据是否在那里找到一个或多个术语来预填充它。
So for example a list of tags could be price, distance, availability, location, and so on with the Keywords or Tags column populated based on the free text in the second column as below例如,标签列表可以是价格、距离、可用性、位置等,其中“关键字”或“标签”列根据第二列中的自由文本填充,如下所示
Category![]() |
Description![]() |
Keywords or Tags![]() |
---|---|---|
A![]() |
Really doesn't like the price and location is too far![]() |
price, location![]() |
B![]() |
The distance is an issue and not too much availability![]() |
Distance, availability![]() |
C ![]() |
Location is close so I like the convenience![]() |
location, convenience![]() |
B![]() |
The distance is near and there is a lot of availability![]() |
availability, distance![]() |
As shown above, the tags would be separated by commas.如上所示,标签将用逗号分隔。
The issue is that the list of predetermined keywords is large (around 20 to 30 tags).问题是预先确定的关键字列表很大(大约 20 到 30 个标签)。
My Questions:我的问题:
What would be the most efficient way to create this list without removing any tags?在不删除任何标签的情况下创建此列表的最有效方法是什么?
Also, is there a way to do this in RStudio?另外,有没有办法在 RStudio 中做到这一点?
We can use regular expressions here to extract the keywords from the strings.我们可以在这里使用正则表达式从字符串中提取关键字。
If we put the keywords in a vector keywords
, we can use the str_extract_all
from the stringr
package to extract all matching words in the string.如果我们将关键字放在向量
keywords
中,我们可以使用str_extract_all
中的stringr
来提取字符串中的所有匹配词。 I've made it into a simple function which we apply to the Description
column of your data.frame, inserting the results into a new variable Keys
我已经把它变成了一个简单的 function,我们将其应用于 data.frame 的
Description
列,将结果插入到一个新的变量Keys
中
library(stringr)
get_tags <- function(str, tags) {
res = str_extract_all(str,
regex(tags, ignore_case = T), # Search case insensitive
simplify = T)[,1] # Get result as vector, not matrix
return(res[nchar(res) > 0]) # Drop empty strings from non-matched keywords
}
df$Keys <- sapply(df$Description,
function(x) paste0(get_tags(x, keywords),
collapse=', ')) # Collapse matches w/ commas
df
Category Description Keys
1 A Really doesn't like the price and location is too far price, location
2 B The distance is an issue and not too much availability distance, availability
3 C Location is close so I like the convenience Location, convenience
4 D The distance is near and there is a lot of availability distance, availability
Since you want the matches to be case insensitive, putting the regex pattern ( tags
) in the regex
function allows us to specify that it should ignore case.由于您希望匹配不区分大小写,因此将正则表达式模式 (
tags
) 放入regex
function 允许我们指定它应该忽略大小写。
A simple solution, which uses Excel formulas and avoids any external dependencies:一个简单的解决方案,它使用 Excel 公式并避免任何外部依赖:
=SEARCH()
function in excel to find tags, populating a column for each keyword or tag=SEARCH()
function 查找标签,为每个关键字或标签填充一列=TEXTJOIN()
to aggregate all tags=TEXTJOIN()
聚合所有标签Example:例子:
A![]() |
B![]() |
C ![]() |
D ![]() |
E![]() |
F ![]() |
G ![]() |
|
---|---|---|---|---|---|---|---|
1 ![]() |
Category![]() |
Description![]() |
Tags (all)![]() |
price![]() |
location![]() |
Distance![]() |
availability![]() |
2 ![]() |
A![]() |
Really doesn't like the price and location is too far![]() |
'=TEXTJOIN(", ", TRUE, D2:XX2) ![]() |
'=IF(ISERROR(SEARCH(D$1,$B2)), "", D$1) ![]() |
------ ![]() |
------ ![]() |
----> ![]() |
Output: Output:
A![]() |
B![]() |
C ![]() |
D ![]() |
E![]() |
F ![]() |
G ![]() |
|
---|---|---|---|---|---|---|---|
1 ![]() |
Category![]() |
Description![]() |
Tags (all)![]() |
price![]() |
location![]() |
Distance![]() |
availability![]() |
2 ![]() |
A![]() |
Really doesn't like the price and location is too far![]() |
price, location![]() |
price![]() |
location![]() |
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.