简体   繁体   中英

Eliminate redundant letters in string? (e.g. gooooooooood -> good)

I'm trying to set up some sample data for a Naive Bayesian Classifier for Twitter.

One of the post-processing of the tweets I'd like to do is to remove unnecessary repeat characters.

For example, one of the tweets reads: Twizzlers. mmmmm goooooooooooood! Twizzlers. mmmmm goooooooooooood!

I'd like to reduce the number of w's down to just two. Why two? That's what the article I'm following did. Any individual word that is less than 2 characters is discarded (see mmmmm above). And as far as gooooooood, I would imagine double letters are the most common to be uber repeated.

So, that said, what's the fastest way (in terms of execution time) to reduce words such as gooooooooood to simply good?

[Edit] I'll be processing 800,000 tweets in this app, hence the requirement for fastest execution [/Edit]

[Edit2] I just ran some simple benchmarking based on elapsed time to iterate through 1000 records & save to a text file. I repeated this iteration 100 times on each method. The average results are here:

Method 1: 386 ms [LINQ - answer was deleted] Method 2: 407 ms [Regex] Method 3: 303 ms [StringBuilder] Method 4: 301 ms [StringBuilder part 2]

Method 1: LINQ (answer was apparently deleted)

static string doIt(string a)
    {
        var l = a.Select((p, i) => new { ch = p, index = i }).
            Where(p => (p.index < a.Length - 2) && (a[p.index + 1] == p.ch) && (a[p.index + 2] == p.ch))
            .Select(p => p.index).ToList();
        l.Sort();
        l.Reverse();

        l.ForEach(i => a = a.Remove(i, 1));

        return a;
    }

METHOD 2:

Regex.Replace(tweet,@"(\S)\1{2,}","$1$1");

Method 3:

static string StringB(string s)
    {
        string input = s;
        StringBuilder sb = new StringBuilder();

        for (int i = 0; i < input.Length; i++)
        {
            if (i < 2 || input[i] != input[i - 1] || input[i] != input[i - 2])
                sb.Append(input[i]);
        }
        string output = sb.ToString();
        return output;

    }

Method 4:

static string sb2(string s)
    {
        string input = s;

        var sb = new StringBuilder(input);

        char p2 = '\0';
        char p1 = '\0';

        int pos = 0, len = sb.Length;
        while (pos < len)
        {
            if (p2 == p1) for (; pos < len && (sb[pos] == p2); len--)
                    sb.Remove(pos, 1);

            if (pos < len)
            {
                p2 = p1;
                p1 = sb[pos];
                pos++;
            }
        }
        return sb.ToString();

    }

This is also (easily) doable via a regular expression:

var re = @"((.)\2)\2*";
Regex.Replace("god", re, "$1")    // god
Regex.Replace("good", re, "$1")   // good
Regex.Replace("gooood", re, "$1") // good

Is it faster than the other approaches? Well, that's for the benchmarks ;-) Regular expressions can be quite efficient in non-degenerate backtracking situations. The above may need to be altered (this will also match spaces for instance), but it's a small example.

Happy coding.

Regexen look to be the simplest. Simple proof of concept in the REPL:

using System.Text.RegularExpressions;  
var regex = new Regex(@"(\S)\1{2,}"); // or @"([aeiouy])\1{2,}" etc?
regex.Replace("mmmmm gooood griieeeeefff", "$1$1");

-->

"mm good griieeff"

For raw performance, use something more like this: see it live on https://ideone.com/uWG68

using System;
using System.Text;

class Program
{
    public static void Main(string[] args)
    {
        string input = "mmmm gooood griiiiiiiiiieeeeeeefffff";

        var sb = new StringBuilder(input);

        char p2 = '\0';
        char p1 = '\0';

        int pos = 0, len=sb.Length;
        while (pos < len)
        {
            if (p2==p1) for (; pos<len && (sb[pos]==p2); len--)
                sb.Remove(pos, 1);

            if (pos<len)
            {
                p2=p1;
                p1=sb[pos];
                pos++;
            }
        }

        Console.WriteLine(sb);
    }
}

I agree with the comments that this will not work in the general case, especially in "Twitter speak". Having said that the rules you mentioned are simple - eliminate every character that is the same as the previous two characters:

string input = "goooooooooooood";
StringBuilder sb = new StringBuilder(input.Length);
sb.Append(input.Substring(0, 2));

for (int i = 2; i < input.Length; i++)
{
    if (input[i] != input[i - 1] || input[i] != input[i - 2])
        sb.Append(input[i]);
}
string output = sb.ToString();

I would recommend looking into NLP solutions rather than C#/regex. In that world, python is preferred. See NLTK . I would recommend Nodebox Linguistics which gives you spelling corrections. You can even stem words and even go down to the infinitive.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM