简体   繁体   中英

Regular Expression to extract words, names, hashtags, and phrases from tweets

I'm working with twitter feeds to sort out words, names, hashtags and phrases in various tweets.

I'm assuming names are several words together that start with capital letters, hashtags are # followed by everything but spaces, phrases are things within quotes, and words are words.

It would also be nice to pull out any links too, but that is not necessary.

I would like to use Regex, but if there is a better solution, I would like to know.

An example Twitter post:

You know you watch a lot of Wes Anderson films when you see his new trailer and think, "Wait, where's the Futura font?" #MoviesILike http://bit.ly/HklUk

would split Wes Anderson , Wait, where's the Futura font? , #MoviesILike , and all of the words

The Regex I'm playing with right now is:

Regex _wordRegex = new Regex(@"(?:\""(?<Item>.*?)\"")|(?<Item>(?:[A-Z][a-z]*?[.\s])+)|(?<Item>#\S+)|(?<Item>\w+)");

I've dealt with my fair share of twitter data. I've found that the best approach is to tokenize the message string by whitespace, then analyze each token. This works pretty well... let's look at the cases:

@bobjones let's go watch the game at @hooters #nfl #broncos #tebow

For the @ and the # tokens, you just have to check the first character. For URLs, you might want to do something with regex there. So basically:

if token[0] == '@' then mention
else if token[0] == '#' then hashtag
else if token looks like a url then url
else then word

No need to complicate things with regex in this case, in my opinion. Especially since you are looking to extract different types of things from the same string.

You mention things within quotes... you might want to handle that as a corner case in the tokenization.

I found that the above answer regarding tokenizing the string by whitespace and iterating through the tokens looking for hashtags only works accurately if you don't have punctuation or other weird characters riding right up against the hashtag. For example I like #programming could be successfully tokenized, but I like #programming, right? will result in an incorrectly-identified hashtag: #programming,

There are a couple of ways of dealing with this problem. I suggest an iterative approach of looking at each character in turn. It'll be slower, but more accurate.

string raw = "hello this is #Totally #Awesome, right? #yeah!";
List<string> hashtags = new List<string>();
StringBuilder sb = null;

foreach (char c in raw.ToLower())
{
    if (c == '#')
    {
        sb = new StringBuilder();
        track = true;
    }
    else if (track)
    {
        if (char.IsLetterOrDigit(c))
        {
            sb.Append(c);
        }
        else
        {
            hashtags.Add(sb.ToString());
            track = false;
        }
    }
}

if (track)
{
    hashtags.Add(sb.ToString());  // Make sure to grab the last one!
}

It strips the hash symbol (which is good so you don't end up with ####### or something) but you should get

totally, awesome, yeah

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM