简体   繁体   中英

Remove list of words from string

I have a list of words that I want to remove from a string I use the following method

string stringToClean = "The.Flash.2014.S07E06.720p.WEB-DL.HEVC.x265.RMTeam";

string[] BAD_WORDS = {
            "720p", "web-dl", "hevc", "x265", "Rmteam", "."
        };
    
var cleaned = string.Join(" ", stringToClean.Split(' ').Where(w => !BAD_WORDS.Contains(w, StringComparer.OrdinalIgnoreCase)));

but it is not working And the following text is output

The.Flash.2014.S07E06.720p.WEB-DL.HEVC.x265.RMTeam

For this it would be a good idea to create a reusable method that splits a string into words. I'll do this as an extension method of string. If you are not familiar with extension methods, readextension methods demystified

public static IEnumerable<string> ToWords(this string text)
{
    // TODO implement
}

Usage will be as follows:

string text = "This is some wild text!"
List<string> words = text.ToWords().ToList();
var first3Words = text.ToWords().Take(3);
var lastWord = text.ToWords().LastOrDefault();

Once you've got this method, the solution to your problem will be easy:

IEnumerable<string> badWords = ...
string inputText = ...
IEnumerable<string> validWords = inputText.ToWords().Except(badWords);

Or maybe you want to use Except(badWords, StringComparer.OrdinalIgnoreCase);

The implementation of ToWords depends on what you would call a word: everything delimited by a dot? or do you want to support whitespaces? or maybe even new-lines?

The implementation for your problem: A word is any sequence of characters delimited by a dot.

public static IEnumerable<string> ToWords(this string text)
{
    // find the next dot:
    const char dot = '.';
    int startIndex = 0;
    int dotIndex = text.IndexOf(dot, startIndex);
    while (dotIndex != -1)
    {
        // found a Dot, return the substring until the dot:
        int wordLength = dotIndex - startIndex;
        yield return text.Substring(startIndex, wordLength;

        // find the next dot      
        startIndex = dotIndex + 1;
        dotIndex = text.IndexOf(dot, startIndex);
    }

    // read until the end of the text. Return everything after the last dot:
    yield return text.SubString(startIndex, text.Length);
}

TODO:

  • Decide what you want to return if text starts with a dot ".ABC.DEF".
  • Decide what you want to return if the text ends with a dot: "ABC.DEF."
  • Check if the return value is what you want if text is empty.

Your split/join don't match up with your input.

That said, here's a quick one-liner:

string clean = BAD_WORDS.Aggregate(stringToClean, (acc, word) => acc.Replace(word, string.Empty));

This is basically a "reduce". Not fantastically performant but over strings that are known to be decently small I'd consider it acceptable. If you have to use a really large string or a really large number of "words" you might look at another option but it should work for the example case you've given us.

Edit: The downside of this approach is that you'll get partials. So for example in your token array you have "720p" but the code I suggested here will still match on "720px" but there are still ways around it. For example instead of using string 's implementation of Replace you could use a regex that will match your delimiters something like Regex.Replace(acc, $"[. ]{word}([. ])", "$1") (regex not confirmed but should be close and I added a capture for the delimiter in order to put it back for the next pass)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM