Filter out substrings in a string vector

Question

I have a string vector like this:

"I love Mangoes." , "I love Mangoes and Apples." , "Apples are good for health" , "I live in America" , "I love Mangoes and Apples and Strawberries." , "Mangoes and Apples." , "Mangoes and Apples and Honey"

I want a string vector which will filter out any complete substring match for any element of the input vector. That is, result would be like :

"Apples are good for health" , "I live in America" , "I love Mangoes and Apples and Strawberries." , "Mangoes and Apples and Honey"

The order doesn't matter. Here, the first two entries were removed because they were the substrings of the third last entry. The second last entry is removed because it is also a substring of previous entries.

Any help would be appreciated. This is part of phrase detection I am doing of a corpus.

Answer 1

You can use grepl with boundaries to capture exact string to match for each of your elements. The ones with more than one match (one = themselves) are the ones to drop, ie

R - Solution

v1 = colSums(sapply(x, function(i) grepl(paste0('\\b', i, '\\b'), x))) <= 1
names(v1)[v1]
#[1] "Apples are good for health"  "I live in America" "I love Mangoes and Apples and Strawberries."
#[4] "Mangoes and Apples and Honey"

Python - Solution

import re
from itertools import compress

v2 = []
for i in x:
    i1 = sum([re.search(i, a) is not None for a in x]) == 1
    v2.append(i1)

list(compress(x, v2))
#['Apples are good for health', 'I live in America', 'I love Mangoes and Apples and Strawberries.', 'Mangoes and Apples and Honey']

Answer 2

You could do this...

vec <- c("I love Mangoes." , "I love Mangoes and Apples." , "Apples are good for health" , 
         "I live in America" , "I love Mangoes and Apples and Strawberries." , 
         "Mangoes and Apples." , "Mangoes and Apples and Honey")

vec <- vec[order(nchar(vec))] #sort by string length

vec[!c(sapply(2:length(vec), #iterate from shortest to longest
              function(i) any(grepl(vec[i-1], vec[i:length(vec)]))), #check whether shorter is included in any longer
       FALSE)] #add value for final (longest) entry

[1] "I live in America"                           "Apples are good for health"                 
[3] "Mangoes and Apples and Honey"                "I love Mangoes and Apples and Strawberries."

Answer 3

We can also use combn to enumerate all pairwise string comparisons, and then use grepl for all pairwise combinations to remove strings that are matched in other strings.

df <- as.data.frame(combn(s, 2));
rmv <- unique(unname(unlist(df[1, sapply(df, function(x) grepl(x[1], x[2]))])))
s[!(s %in% rmv)]
#[1] "Apples are good for health"
#[2] "I live in America"
#[3] "I love Mangoes and Apples and Strawberries"
#[4] "Mangoes and Apples and Honey"

Sample data

s <- c(
    "I love Mangoes" ,
    "I love Mangoes and Apples" ,
    "Apples are good for health" ,
    "I live in America" ,
    "I love Mangoes and Apples and Strawberries" ,
    "Mangoes and Apples" ,
    "Mangoes and Apples and Honey")

Filter out substrings in a string vector

Question

3 answers

solution1
2 2018-04-18 13:09:37

solution2
1 2018-04-18 13:10:12

solution3
1 2018-04-18 13:15:26

Sample data

Filter out substrings in a string vector

Question

3 answers

solution1 2 2018-04-18 13:09:37

solution2 1 2018-04-18 13:10:12

solution3 1 2018-04-18 13:15:26

Sample data

solution1
2 2018-04-18 13:09:37

solution2
1 2018-04-18 13:10:12

solution3
1 2018-04-18 13:15:26