在 NLP 中提取包括連字符在內的復合名詞時遇到問題

Question

背景和目標

我想從每個句子中提取名詞和復合名詞，包括連字符，如下所示。 如果它包含連字符，我需要用連字符提取它。

{The T-shirt is old.: ['T-shirt'], 
I bought the computer and the new web-cam.: ['computer', 'web-cam'], 
I bought the computer and the new web camera.: ['computer', 'web camera']}

問題

當前輸出如下。 復合名詞的第一個詞上有標簽“復合”，但我現在無法提取我所期望的內容。

T T PROPN NNP compound X True False
shirt shirt NOUN NN nsubj xxxx True False
computer computer NOUN NN dobj xxxx True False
web web NOUN NN compound xxx True False
cam cam NOUN NN conj xxx True False
computer computer NOUN NN dobj xxxx True False
web web NOUN NN compound xxx True False
camera camera NOUN NN conj xxxx True False

{The T-shirt is old.: ['T -', 'T', 'T -', 'shirt'], 
I bought the computer and the new web-cam.: ['web -', 'computer', 'web -', 'web', 'web -', 'cam'], 
I bought the computer and the new web camera.: ['web camera', 'computer', 'web camera', 'web', 'web camera', 'camera']}

當前代碼

我正在使用 NLP 庫 spaCy 來區分名詞和復合名詞。 希望聽到您的建議如何修復當前代碼。

import spacy
nlp = spacy.load("en_core_web_sm")

texts =  ["The T-shirt is old.", "I bought the computer and the new web-cam.", "I bought the computer and the new web camera."]

nouns = []*len(texts)
dic = {k: v for k, v in zip(texts, nouns)}

for i in range(len(texts)):
    text = nlp(texts[i])
    words = []
    for word in text:
        if word.pos_ == 'NOUN'or word.pos_ == 'PROPN':
            print(word.text, word.lemma_, word.pos_, word.tag_, word.dep_,
                word.shape_, word.is_alpha, word.is_stop)

            #compound words
            for j in range(len(text)):
                    token = text[j]
                    if token.dep_ == 'compound':
                        if j < len(text)-1:
                            nexttoken = text[j+1]
                            words.append(str(token.text + ' ' + nexttoken.text))


            else:
                words.append(word.text)
    dic[text] = words       
print(dic)

開發環境

蟒蛇 3.7.4

SpaCy 版本 2.3.2

Answer 1

請嘗試：

import spacy
nlp = spacy.load("en_core_web_sm")

texts =  ("The T-shirt is old",
          "I bought the computer and the new web-cam",
          "I bought the computer and the new web camera",
         )
docs = nlp.pipe(texts)  

compounds = []
for doc in docs:
    compounds.append({doc.text:[doc[tok.i:tok.head.i+1] for tok in doc if tok.dep_=="compound"]})
print(compounds)
[{'The T-shirt is old.': [T-shirt]}, 
{'I bought the computer and the new web-cam.': [web-cam]}, 
{'I bought the computer and the new web camera.': [web camera]}]

此列表中缺少computer但我認為它不符合化合物的條件。

在 NLP 中提取包括連字符在內的復合名詞時遇到問題

問題描述

背景和目標

問題

當前代碼

開發環境

1 個解決方案

解決方案1
2 已采納 2020-10-15 08:21:37

在 NLP 中提取包括連字符在內的復合名詞時遇到問題

問題描述

背景和目標

問題

當前代碼

開發環境

1 個解決方案

解決方案1 2 已采納 2020-10-15 08:21:37

解決方案1
2 已采納 2020-10-15 08:21:37