简体   繁体   中英

NLTK FreqDist counting two words as one

I am having some trouble with NLTK's FreqDist. Let me give you some context first:

  • I have built a web crawler that crawls webpages of companies selling wearable products (smartwatches etc.).
  • I am then doing some linguistic analysis and for that analysis I am also using some NLTK functions - in this case FreqDist .
  • nltk.FreqDist works fine in general - it does the job and does it well; I don't get any errors etc.

My only problem is that the word "heart rate" comes up often and because I am generating a list of the most frequently used words, I get heart and rate separately to the tune of a few hundred occurrences each.

Now of course rate and heart can both occur without being used as "heart rate" but how do I count the occurrences of "heart rate" instead of just the 2 words separately and I do mean in an accurate way. I don't want to subtract one from the other in my current Counters or anything like that.

Thank you in advance!

One way to accomplish this is by pre-processing your text before you pass it to FreqDist . This could be done before or after you call word_tokenize (assuming that's the only other step in your pipeline, otherwise it depends on what the other steps are doing).

You also have to decide if you want to distinguish between occurrences of "heart rate" and "heartrate", or treat them both as the same "word". If you want to distinguish them (and again, if it won't mess up later steps), you could call it something like heart_rate . This keeps it as one "word", but distinct from "heartrate".

I'll use this as an example sentence:

original = "A heart rate monitor measures your heartrate."

To do this before tokenization, you could do a simple replace :

def preprocess(text):
    return text.replace("heart rate", "heart_rate")

txt = preprocess(original)
tokens = nltk.word_tokenize(txt)
nltk.FreqDist(tokens).tabulate()

This results in:

monitor       your          .   measures  heartrate heart_rate          A
      1          1          1          1          1          1          1

If you wanted to treat them the same, you'd just change it to text.replace("heart rate", "heartrate") . This would result in:

heartrate   monitor      your         .  measures         A
        2         1         1         1         1         1

If you want to process after tokenization, it is a little more complicated since you now have a list of tokens to loop through. Here's an example:

def process_tokens(tokens):
    deleted = 0
    for i in range(len(tokens)):
        i = i - deleted
        if tokens[i] == "heart":
            if tokens[i+1] == "rate":
                tokens[i] = "heart_rate"
                del tokens[i+1]
                deleted += 1 # keep track so we don't get an IndexError later
    return tokens

When this finds a "heart" token, it checks if the next one is "rate", and if so merges the two. Again, you can change it from heart_rate to heartrate if you wish. This function would be used like:

tokens = nltk.word_tokenize(original)
nltk.FreqDist(process_tokens(tokens)).tabulate()

Giving the same results as the first.

This is a well-known problem in NLP and it is often referred to Tokenization. I can think about two possible solutions:

  • try different NLTK tokenizers (eg twitter tokenizer), which maybe will be able to cover all of your cases
  • run a Name Entity Recognition (NER) on your sentences. This allows you to recognise entity present in the text. This could work because it can recognise Heart rate as a single entity, thus as a single token.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM