简体   繁体   中英

How to extract key phrases from a given text with OpenNLP?

I'm using Apache OpenNLP and i'd like to extract the Keyphrases of a given text. I'm already gathering entities - but i would like to have Keyphrases.

The problem i have is that i can't use TF-IDF cause i don't have models for that and i only have a single text (not multiple documents)

Here is some code (prototyped - not so clean)

 public List<KeywordsModel> extractKeywords(String text, NLPProvider pipeline) {

        SentenceDetectorME sentenceDetector = new SentenceDetectorME(pipeline.getSentencedetecto("en"));
        TokenizerME tokenizer = new TokenizerME(pipeline.getTokenizer("en"));
        POSTaggerME posTagger = new POSTaggerME(pipeline.getPosmodel("en"));
        ChunkerME chunker = new ChunkerME(pipeline.getChunker("en"));

        ArrayList<String> stopwords = pipeline.getStopwords("en");

        Span[] sentSpans = sentenceDetector.sentPosDetect(text);
        Map<String, Float> results = new LinkedHashMap<>();
        SortedMap<String, Float> sortedData = new TreeMap(new MapSort.FloatValueComparer(results));

        float sentenceCounter = sentSpans.length;
        float prominenceVal = 0;
        int sentences = sentSpans.length;
        for (Span sentSpan : sentSpans) {
            prominenceVal = sentenceCounter / sentences;
            sentenceCounter--;
            String sentence = sentSpan.getCoveredText(text).toString();
            int start = sentSpan.getStart();
            Span[] tokSpans = tokenizer.tokenizePos(sentence);
            String[] tokens = new String[tokSpans.length];
            for (int i = 0; i < tokens.length; i++) {
                tokens[i] = tokSpans[i].getCoveredText(sentence).toString();
            }
            String[] tags = posTagger.tag(tokens);
            Span[] chunks = chunker.chunkAsSpans(tokens, tags);
            for (Span chunk : chunks) {
                if ("NP".equals(chunk.getType())) {
                    int npstart = start + tokSpans[chunk.getStart()].getStart();
                    int npend = start + tokSpans[chunk.getEnd() - 1].getEnd();
                    String potentialKey = text.substring(npstart, npend);
                    if (!results.containsKey(potentialKey)) {
                        boolean hasStopWord = false;
                        String[] pKeys = potentialKey.split("\\s+");
                        if (pKeys.length < 3) {
                            for (String pKey : pKeys) {
                                for (String stopword : stopwords) {
                                    if (pKey.toLowerCase().matches(stopword)) {
                                        hasStopWord = true;
                                        break;
                                    }
                                }
                                if (hasStopWord == true) {
                                    break;
                                }
                            }
                        }else{
                            hasStopWord=true;
                        }
                        if (hasStopWord == false) {
                            int count = StringUtils.countMatches(text, potentialKey);
                            results.put(potentialKey, (float) (Math.log(count) / 100) + (float)(prominenceVal/5));
                        }
                    }
                }
            }
        }
        sortedData.putAll(results);
        System.out.println(sortedData);
        return null;
    }

What it basically does is giving me the Nouns back and sorting them by prominence value (where is it in the text?) and counts.

But honestly - this doesn't work soo good.

I also tried it with lucene analyzer but the results were also not so good.

So - how can i achieve what i want to do? I already know of KEA/Maui-indexer etc (but i'm afraid i can't use them because of GPL :( )


Also interesting? Which other algorithms can i use instead of TF-IDF?

Example:

This text: http://techcrunch.com/2015/09/04/etsys-pulling-the-plug-on-grand-st-at-the-end-of-this-month/

Good output in my opinion: Etsy, Grand St., solar chargers, maker marketplace, tech hardware

Finally, i found something:

https://github.com/srijiths/jtopia

It is using the POS from opennlp/stanfordnlp. It has an ALS2 license. Haven't measured precision and recall yet but it delivers great results in my opinion.

Here is my code:

 Configuration.setTaggerType("openNLP");
        Configuration.setSingleStrength(6);
        Configuration.setNoLimitStrength(5);
        // if tagger type is "openNLP" then give the openNLP POS tagger path
        //Configuration.setModelFileLocation("model/openNLP/en-pos-maxent.bin"); 
        // if tagger type is "default" then give the default POS lexicon file
        //Configuration.setModelFileLocation("model/default/english-lexicon.txt");
        // if tagger type is "stanford "
        Configuration.setModelFileLocation("Dont need that here");
        Configuration.setPipeline(pipeline);
        TermsExtractor termExtractor = new TermsExtractor();
        TermDocument topiaDoc = new TermDocument();
        topiaDoc = termExtractor.extractTerms(text);
        //logger.info("Extracted terms : " + topiaDoc.getExtractedTerms());
        Map<String, ArrayList<Integer>> finalFilteredTerms = topiaDoc.getFinalFilteredTerms();
        List<KeywordsModel> keywords = new ArrayList<>();
        for (Map.Entry<String, ArrayList<Integer>> e : finalFilteredTerms.entrySet()) {
            KeywordsModel keyword = new KeywordsModel();
            keyword.setLabel(e.getKey());
            keywords.add(keyword);
        }

I modified the Configurationfile a bit so that the POSModel is loaded from the pipeline instance.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM