将两个连续的单词视为词频中的一个

Question

I have this sentence:我有这句话：

Sentence
    Who the president of Kuala Lumpur is?

I am trying to extract all the words (tokenization)我正在尝试提取所有单词（标记化）

low_case = df['Sentence'].str.lower().str.cat(sep=' ')
words = nltk.tokenize.word_tokenize(low_case)
word_dist = nltk.FreqDist(words)

example = pd.DataFrame(word_dist.most_common(1000),
                    columns=['Word', 'Freq'])

However, I would like to extract Kuala Lumpur as bi-gram, so I am considering a filter which says "if there are two consecutive words having capital letters, extract them as a unique word. So if I have this list:但是，我想将吉隆坡提取为双元组，所以我正在考虑一个过滤器，它说“如果有两个连续的单词有大写字母，则将它们提取为一个唯一的单词。所以如果我有这个列表：

    Who the president of Kuala Lumpur is?

I would have (using the code above):我会（使用上面的代码）：

Word            Freq
who               1
is                1
president         1
of                1
Kuala             1
Lumpur            1 
is                1

but I would like to have this:但我想要这个：

Word            Freq
who               1
is                1
president         1
of                1
Kuala Lumpur      1
is                1

I think to find two consecutive capital letters I should apply the following pattern:我认为要找到两个连续的大写字母，我应该应用以下模式：

pattern = r"[A-Z]{2}-\d{3}-[A-Z]{2}"

o anche: o 痛：

re.findall('([A-Z][\w-]*(?:\s+[A-Z][\w-]*)+)', df.Sentence.tolist())

But I do not know how to include this information in my code above.但我不知道如何在我上面的代码中包含这些信息。

Answer 1

You could do some pre-processing and separate the bi-grams from the rest of the sentence using re Match Objects .您可以进行一些预处理并使用re Match Objects将二元组与句子的其余部分分开。 For example:例如：

import re

# initialize sentence text
sentence_without_bigrams = 'Who the president of Kuala Lumpur or Other Place is?'
bigrams = []

# loop until there are no remaining bi-grams
while True:
    # find bi-grams
    match = re.search('([A-Z][\w-]*(?:\s+[A-Z][\w-]*)+)', sentence_without_bigrams)
    if match == None:
        break
    else:
        # add bi-gram to list of bi-grams
        bigrams.append(sentence_without_bigrams[match.start():match.end()])
        # remove bigram from sentence
        sentence_without_bigrams = (sentence_without_bigrams[:match.start()-1] + sentence_without_bigrams[match.end():])


print(bigrams)
>> ['Kuala Lumpur', 'Other Place']

print(sentence_without_bigrams)
>> Who the president of or is?

However, this solution falls short of your ultimate goal, since a sentence like 'Hello, Mr President Obama' would not be captured correctly (as noted here ).然而，这种解决方案达不到你的终极目标，因为就像一句'Hello, Mr President Obama'将不会被正确捕获（如注意这里）。

将两个连续的单词视为词频中的一个

问题描述

1 个解决方案

解决方案1
0 2020-10-29 01:56:59

将两个连续的单词视为词频中的一个

问题描述

1 个解决方案

解决方案1 0 2020-10-29 01:56:59

解决方案1
0 2020-10-29 01:56:59