简体   繁体   English

excel 中的自动标记/文本挖掘

[英]Automated tagging/text mining in excel

I have a monthly excel spreadsheet with the following:我有一个每月 excel 电子表格,其中包含以下内容:

Category类别 Description描述
A一个 free text in paragraph form段落形式的自由文本
B free text in paragraph form段落形式的自由文本
C C free text in paragraph form段落形式的自由文本
B free text in paragraph form段落形式的自由文本
B free text in paragraph form段落形式的自由文本
A一个 free text in paragraph form段落形式的自由文本

I would like to add a third column that adds tags or keywords from a predetermined list that searches the free text and then pre-populates it based on whether one or more of the terms is found there or not.我想添加第三列,它从搜索自由文本的预定列表中添加标签或关键字,然后根据是否在那里找到一个或多个术语来预填充它。

So for example a list of tags could be price, distance, availability, location, and so on with the Keywords or Tags column populated based on the free text in the second column as below例如,标签列表可以是价格、距离、可用性、位置等,其中“关键字”或“标签”列根据第二列中的自由文本填充,如下所示

Category类别 Description描述 Keywords or Tags关键字或标签
A一个 Really doesn't like the price and location is too far真的不喜欢这个价格和位置太远了 price, location价格,位置
B The distance is an issue and not too much availability距离是一个问题,可用性并不高 Distance, availability距离,可用性
C C Location is close so I like the convenience位置很近所以我喜欢方便 location, convenience位置,方便
B The distance is near and there is a lot of availability距离近,有很多空房 availability, distance可用性,距离

As shown above, the tags would be separated by commas.如上所示,标签将用逗号分隔。

The issue is that the list of predetermined keywords is large (around 20 to 30 tags).问题是预先确定的关键字列表很大(大约 20 到 30 个标签)。

My Questions:我的问题:

What would be the most efficient way to create this list without removing any tags?在不删除任何标签的情况下创建此列表的最有效方法是什么?

Also, is there a way to do this in RStudio?另外,有没有办法在 RStudio 中做到这一点?

We can use regular expressions here to extract the keywords from the strings.我们可以在这里使用正则表达式从字符串中提取关键字。

If we put the keywords in a vector keywords , we can use the str_extract_all from the stringr package to extract all matching words in the string.如果我们将关键字放在向量keywords中,我们可以使用str_extract_all中的stringr来提取字符串中的所有匹配词。 I've made it into a simple function which we apply to the Description column of your data.frame, inserting the results into a new variable Keys我已经把它变成了一个简单的 function,我们将其应用于 data.frame 的Description列,将结果插入到一个新的变量Keys

library(stringr)

get_tags <- function(str, tags) {
    res = str_extract_all(str,
                          regex(tags, ignore_case = T), # Search case insensitive
                          simplify = T)[,1] # Get result as vector, not matrix
    return(res[nchar(res) > 0])  # Drop empty strings from non-matched keywords
}

df$Keys <- sapply(df$Description,
                  function(x) paste0(get_tags(x, keywords),
                                     collapse=', ')) # Collapse matches w/ commas

df

  Category                                             Description                   Keys
1        A   Really doesn't like the price and location is too far        price, location
2        B  The distance is an issue and not too much availability distance, availability
3        C             Location is close so I like the convenience  Location, convenience
4        D The distance is near and there is a lot of availability distance, availability

Since you want the matches to be case insensitive, putting the regex pattern ( tags ) in the regex function allows us to specify that it should ignore case.由于您希望匹配不区分大小写,因此将正则表达式模式 ( tags ) 放入regex function 允许我们指定它应该忽略大小写。

A simple solution, which uses Excel formulas and avoids any external dependencies:一个简单的解决方案,它使用 Excel 公式并避免任何外部依赖:

  • use the =SEARCH() function in excel to find tags, populating a column for each keyword or tag使用 excel 中的=SEARCH() function 查找标签,为每个关键字或标签填充一列
  • Use =TEXTJOIN() to aggregate all tags使用=TEXTJOIN()聚合所有标签

Example:例子:

A一个 B C C D D E F F G G
1 1 Category类别 Description描述 Tags (all)标签(全部) price价格 location地点 Distance距离 availability可用性
2 2 A一个 Really doesn't like the price and location is too far真的不喜欢这个价格和位置太远了 '=TEXTJOIN(", ", TRUE, D2:XX2) '=TEXTJOIN(", ", TRUE, D2:XX2) '=IF(ISERROR(SEARCH(D$1,$B2)), "", D$1) '=IF(ISERROR(搜索(D$1,$B2)), "", D$1) ------ ------ ------ ------ ----> ---->

Output: Output:

A一个 B C C D D E F F G G
1 1 Category类别 Description描述 Tags (all)标签(全部) price价格 location地点 Distance距离 availability可用性
2 2 A一个 Really doesn't like the price and location is too far真的不喜欢这个价格和位置太远了 price, location价格,位置 price价格 location地点

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM