简体   繁体   中英

Automated tagging/text mining in excel

I have a monthly excel spreadsheet with the following:

Category Description
A free text in paragraph form
B free text in paragraph form
C free text in paragraph form
B free text in paragraph form
B free text in paragraph form
A free text in paragraph form

I would like to add a third column that adds tags or keywords from a predetermined list that searches the free text and then pre-populates it based on whether one or more of the terms is found there or not.

So for example a list of tags could be price, distance, availability, location, and so on with the Keywords or Tags column populated based on the free text in the second column as below

Category Description Keywords or Tags
A Really doesn't like the price and location is too far price, location
B The distance is an issue and not too much availability Distance, availability
C Location is close so I like the convenience location, convenience
B The distance is near and there is a lot of availability availability, distance

As shown above, the tags would be separated by commas.

The issue is that the list of predetermined keywords is large (around 20 to 30 tags).

My Questions:

What would be the most efficient way to create this list without removing any tags?

Also, is there a way to do this in RStudio?

We can use regular expressions here to extract the keywords from the strings.

If we put the keywords in a vector keywords , we can use the str_extract_all from the stringr package to extract all matching words in the string. I've made it into a simple function which we apply to the Description column of your data.frame, inserting the results into a new variable Keys

library(stringr)

get_tags <- function(str, tags) {
    res = str_extract_all(str,
                          regex(tags, ignore_case = T), # Search case insensitive
                          simplify = T)[,1] # Get result as vector, not matrix
    return(res[nchar(res) > 0])  # Drop empty strings from non-matched keywords
}

df$Keys <- sapply(df$Description,
                  function(x) paste0(get_tags(x, keywords),
                                     collapse=', ')) # Collapse matches w/ commas

df

  Category                                             Description                   Keys
1        A   Really doesn't like the price and location is too far        price, location
2        B  The distance is an issue and not too much availability distance, availability
3        C             Location is close so I like the convenience  Location, convenience
4        D The distance is near and there is a lot of availability distance, availability

Since you want the matches to be case insensitive, putting the regex pattern ( tags ) in the regex function allows us to specify that it should ignore case.

A simple solution, which uses Excel formulas and avoids any external dependencies:

  • use the =SEARCH() function in excel to find tags, populating a column for each keyword or tag
  • Use =TEXTJOIN() to aggregate all tags

Example:

A B C D E F G
1 Category Description Tags (all) price location Distance availability
2 A Really doesn't like the price and location is too far '=TEXTJOIN(", ", TRUE, D2:XX2) '=IF(ISERROR(SEARCH(D$1,$B2)), "", D$1) ------ ------ ---->

Output:

A B C D E F G
1 Category Description Tags (all) price location Distance availability
2 A Really doesn't like the price and location is too far price, location price location

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM