简体   繁体   中英

More efficient way to make a string in a string of just words

I am making an application where I will be fetching tweets and storing them in a database. I will have a column for the complete text of the tweet and another where only the words of the tweet will remain (I need the words to calculate which words were most used later).

How I currently do it is by using 6 different .replaceAll() functions which some of them might be triggered twice. For example I will have a for loop to remove every "hashtag" using replaceAll() .

The problem is that I will be editing as many as thousands of tweets that I fetch every few minutes and I think that the way I am doing it will not be too efficient.

What my requirements are in this order (also written in comments down bellow):

  1. Delete all usernames mentioned
  2. Delete all RT (retweets flags)
  3. Delete all hashtags mentioned
  4. Replace all break lines with spaces
  5. Replace all double spaces with single spaces
  6. Delete all special characters except spaces

Here is a Short and Compilable Example:

public class StringTest {

    public static void main(String args[]) {

        String text = "RT @AshStewart09: Vote for Lady Gaga for \"Best Fans\""
                + " at iHeart Awards\n"
                + "\n"
                + "RT!!\n"
                + "\n"
                + "My vote for #FanArmy goes to #LittleMonsters #iHeartAwards"
                + " htt…";

        String[] hashtags = {"#FanArmy", "#LittleMonsters", "#iHeartAwards"};
        System.out.println("Before: " + text + "\n");

        // Delete all usernames mentioned (may run multiple times)
        text = text.replaceAll("@AshStewart09", "");
        System.out.println("First Phase: " + text + "\n");

        // Delete all RT (retweets flags)
        text = text.replaceAll("RT", "");
        System.out.println("Second Phase: " + text + "\n");

        // Delete all hashtags mentioned
        for (String hashtag : hashtags) {
            text = text.replaceAll(hashtag, "");
        }
        System.out.println("Third Phase: " + text + "\n");

        // Replace all break lines with spaces
        text = text.replaceAll("\n", " ");
        System.out.println("Fourth Phase: " + text + "\n");

        // Replace all double spaces with single spaces
        text = text.replaceAll(" +", " ");
        System.out.println("Fifth Phase: " + text + "\n");

        // Delete all special characters except spaces 
        text = text.replaceAll("[^a-zA-Z0-9 ]+", "").trim();
        System.out.println("Finaly: " + text);
    }
}

Relying on replaceAll is probably the biggest performance killer as it compiles the regex again and again. The use of regexes for everything is probably the second most significant problem.

Assuming all usernames start with @ , I'd replace

// Delete all usernames mentioned (may run multiple times)
text = text.replaceAll("@AshStewart09", "");

by a loop copying everything until it founds a @ , then checking if the following chars match any of the listed usernames and possibly skipping them. For this lookup you could use a trie . A simpler method would be a replaceAll -like loop for the regex #\\w+ together with a HashMap lookup.

// Delete all RT (retweets flags)
text = text.replaceAll("RT", "");

Here,

private static final Pattern RT_PATTERN = Pattern.compile("RT");

is a sure win. All the following parts could be handled similarly. Instead of

// Delete all special characters except spaces 
text = text.replaceAll("[^a-zA-Z0-9 ]+", "").trim();

you could use Guava's CharMatcher . The method removeFrom does exactly what you did, but collapseFrom or trimAndCollapseFrom might be better.

According to the now closed question , it all boils down to

tweet = tweet.replaceAll("@\\w+|#\\w+|\\bRT\\b", "")
                .replaceAll("\n", " ")
                .replaceAll("[^\\p{L}\\p{N} ]+", " ")
                .replaceAll(" +", " ")
                .trim();

The second line seems to be redundant as the third one does remove \\n too. Changing the first line's replacement to " " doesn't change the outcome an allows to aggregate the replacements.

tweet = tweet.replaceAll("@\\w*|#\\w*|\\bRT\\b|[^@#\\p{L}\\p{N} ]+", " ")
                .replaceAll(" +", " ")
                .trim();

I've changed the usernames and hashtags part to eating also lone # or @ , so that it doesn't need to be consumed by the special chars part. This is necessary for corrent processing of strings like !@AshStewart09 .

For maximum performance, you surely need a precompiled pattern. I'd also re-suggest to use Guava's CharMatcher for the second part. Guava is huge (2 MB I guess), but you surely find more useful things there. So in the end you can get

private static final Pattern PATTERN =
    Pattern.compile("@\\w*|#\\w*|\\bRT\\b|[^@#\\p{L}\\p{N} ]+");
private static final CharMatcher CHAR_MATCHER = CharMacher.is(" ");

tweet = PATTERN.matcher(tweet).replaceAll(" ");
tweet = CHAR_MATCHER.trimAndCollapseFrom(tweet, " ");

You can inline all of the things that are being replaced with nothing into one call to replace all and everything that is replaced with a space into one call like so (also using a regex to find the hashtags and usernames as this seems easier):

text = text.replaceAll("@\w+|#\w+|RT", "");
text = text.replaceAll("\n| +", " ");
text = text.replaceAll("[^a-zA-Z0-9 ]+", "").trim();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM