簡體   English   中英

在 NLP 中提取包括連字符在內的復合名詞時遇到問題

[英]Trouble to extract compound nouns including hyphens in NLP

背景和目標

我想從每個句子中提取名詞和復合名詞,包括連字符,如下所示。 如果它包含連字符,我需要用連字符提取它。

{The T-shirt is old.: ['T-shirt'], 
I bought the computer and the new web-cam.: ['computer', 'web-cam'], 
I bought the computer and the new web camera.: ['computer', 'web camera']}

問題

當前輸出如下。 復合名詞的第一個詞上有標簽“復合”,但我現在無法提取我所期望的內容。

T T PROPN NNP compound X True False
shirt shirt NOUN NN nsubj xxxx True False
computer computer NOUN NN dobj xxxx True False
web web NOUN NN compound xxx True False
cam cam NOUN NN conj xxx True False
computer computer NOUN NN dobj xxxx True False
web web NOUN NN compound xxx True False
camera camera NOUN NN conj xxxx True False

{The T-shirt is old.: ['T -', 'T', 'T -', 'shirt'], 
I bought the computer and the new web-cam.: ['web -', 'computer', 'web -', 'web', 'web -', 'cam'], 
I bought the computer and the new web camera.: ['web camera', 'computer', 'web camera', 'web', 'web camera', 'camera']}

當前代碼

我正在使用 NLP 庫 spaCy 來區分名詞和復合名詞。 希望聽到您的建議如何修復當前代碼。

import spacy
nlp = spacy.load("en_core_web_sm")

texts =  ["The T-shirt is old.", "I bought the computer and the new web-cam.", "I bought the computer and the new web camera."]

nouns = []*len(texts)
dic = {k: v for k, v in zip(texts, nouns)}

for i in range(len(texts)):
    text = nlp(texts[i])
    words = []
    for word in text:
        if word.pos_ == 'NOUN'or word.pos_ == 'PROPN':
            print(word.text, word.lemma_, word.pos_, word.tag_, word.dep_,
                word.shape_, word.is_alpha, word.is_stop)

            #compound words
            for j in range(len(text)):
                    token = text[j]
                    if token.dep_ == 'compound':
                        if j < len(text)-1:
                            nexttoken = text[j+1]
                            words.append(str(token.text + ' ' + nexttoken.text))


            else:
                words.append(word.text)
    dic[text] = words       
print(dic)

開發環境

蟒蛇 3.7.4

SpaCy 版本 2.3.2

請嘗試:

import spacy
nlp = spacy.load("en_core_web_sm")

texts =  ("The T-shirt is old",
          "I bought the computer and the new web-cam",
          "I bought the computer and the new web camera",
         )
docs = nlp.pipe(texts)  

compounds = []
for doc in docs:
    compounds.append({doc.text:[doc[tok.i:tok.head.i+1] for tok in doc if tok.dep_=="compound"]})
print(compounds)
[{'The T-shirt is old.': [T-shirt]}, 
{'I bought the computer and the new web-cam.': [web-cam]}, 
{'I bought the computer and the new web camera.': [web camera]}]

此列表中缺少computer但我認為它不符合化合物的條件。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM