简体   繁体   中英

How to extract all possible noun phrases from text

I want to extract some desirable concepts (noun phrases) in the text automatically. My plan is to extract all noun phrases and then label them as two classifications (ie, desirable phrases and non-desirable phrases). After that, train a classifier to classify them. What I am trying now is to extract all possible phrases as the training set first. For example, one sentence is Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described. I want to get all phrases like shoulder , richer mix , shoulder of richer mix , junctions , junctions of columns and beams , columns and beams , columns , beams or whatever possible. The desirable phrases are shoulder , junctions , junctions of columns and beams . But I don't care the correctness at this step, I just want to get the training set first. Are there available tools for such task?

I tried Rake in rake_nltk, but the results failed to include my desirable phrases (ie, it did not extract all possible phrases)

from rake_nltk import Rake
data = 'Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.'
r = Rake()
r.extract_keywords_from_text(data)
phrase = r.get_ranked_phrases()
print(phrase)enter code herenter code here

Result: ['richer mix', 'shoulder', 'required', 'junctions', 'items', 'described', 'columns', 'beams'] (Missed junctions of columns and beams here)

I also tried phrasemachine, the results also missed some desirable ones.

import spacy
import phrasemachine
matchedList=[]
doc = nlp(data)
tokens = [token.text for token in doc]
pos = [token.pos_ for token in doc]
out = phrasemachine.get_phrases(tokens=tokens, postags=pos, output="token_spans")
print(out['token_spans'])
while len(out['token_spans']):
    start,end = out['token_spans'].pop()
    print(tokens[start:end])

Result:

[(2, 6), (4, 6), (14, 17)]
['junctions', 'of', 'columns']
['richer', 'mix']
['shoulder', 'of', 'richer', 'mix'] 

(Missed many noun phrases here)

You may wish to make use of noun_chunks attribute:

import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('Where a shoulder of richer mix is required at these junctions, or at junctions of columns and beams, the items are so described.')

phrases = set() 
for nc in doc.noun_chunks:
    phrases.add(nc.text)
    phrases.add(doc[nc.root.left_edge.i:nc.root.right_edge.i+1].text)
print(phrases)
{'junctions of columns and beams', 'junctions', 'the items', 'a shoulder', 'columns', 'richer mix', 'beams', 'columns and beams', 'a shoulder of richer mix', 'these junctions'}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM