简体   繁体   中英

How to get minimum sentences from sentences corpus whose words covers the maximum sentences in the original corpus?

everyone.

I have one "optimization" issue and I don't really know which way I should set off. Here's description of my problem:

I have a corpus with plenty of text sentences. Now, I need to obtain the minimum of sentences to record (as audio files) but at the same time to maximize the number of sentences in the original corpus formed from the recorded sentences - more exactly from recorded words.

A very short example of what I need to do:

Corpus:

  • black dog
  • grey cat
  • big dog
  • grey mouse
  • big mouse

Example of minimum sentences to cover the maximum of the original corpus:

  • black dog
  • big mouse
  • grey cat

From 3 sentences (and their words) above we are able to form the rest of sentences in corpus. Of course, I'm looking for some method computationally optimal because my corpus contains thousands of sentences. Do you know any method that is appropriate for this issue?

Thanks for your answers!

Morphid

If your corpus is as simple as you show, and you don't really need to create sentences, you can just compute the unigrams. If it is more complex, run a form of topic modeling. Topic modeling will return the words common across the corpus. You will need to have your corpus in a set of documents. In your case each 'document' could be a sentence. A good topic modeling algorithm is called "Latent Dirichlet Allocation" (LDA).

For a technical paper on LDA see Latent Dirichlet Allocation .

For an article with sample Python code using the gensim library see Experiments on the English Wikipedia .

The following article and sample code by Jordan Barber, Latent Dirichlet Allocation (LDA) with Python , uses NLTK to create a corpus and gensim for LDA. This code is more adaptable to other applications than the Wikipedia code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM