简体   繁体   中英

Word list generation (sorting, optimization)

A little background:

I'm constructing lists of words for a psychology experiment. We're trying to create a chain of words such that words adjacent in the list are related, but all other words in the list are not related. So for example:

SCHOOL, CAFETERIA, PIZZA, CRUST, EARTH, OCEAN, WHALE, ...

So here we see the first word is related to the second, and the second is related to the third, but the third isn't related to the first. (And the first isn't related to the fourth, fifth, sixth, ... either)

What I have so far...

I have a list of 1600 words such that each number from 0 to 1600 corresponds to a word. I also have a very large matrix (1600 x 1600) that tells me (on a scale of 0 to 1) how related each word is to every other word. (These are from a latent semantic analysis; http://lsa.colorado.edu/ )

I can make the lists, but it's not very efficient at all, and my adjacent words aren't super strongly related to each other.

Here's my basic algorithm:

  • Set thresholds for minimum value for how related the adjacent words must be and for how unrelated the non-adjacent words must be.
  • Create a list from 0 to 1600. Shuffle that list. The first item of the list will be our first word.
  • Loop through our words, checking one by one if the word meets our thresholds (ie, check that this new word is high enough related to the last added word in the list, loop through our list and check that it's unrelated to all other words and that it isn't already in our list). If it meets the criteria, add it to the list. If we loop through all words with no success, dump the list and start all over.
  • Continue this until the list has as many words as I want (ideally, 16).

Does anyone have a better approach? The problem with my approach is that I'll sometimes settle for an okay match that meets my criteria when a better match is potentially still out there. Also, it would be nice if I didn't have to dump the whole list but could backtrack a few steps to where the list potentially went wrong.

Loop through our words... If it meets the criteria, add it to the list.

This seems to be the point of issue. You are stopping at the first match, not the best match. Using your 1600 square matrix of relatedness values, you can simply get the index of the maximum relatedness value for the remaining words, then go the word matrix and add the corresponding word to the list.

Try nltk Natural Language Toolkit from www.nltk.org . NLTK may already have something similar to what you are looking for.

This might be a good candidate for a genetic algorithm. You can create a large number of completely random possibilities, score each one with an objective function, and then iterate the population by crossing over mates based on fitness (possibly throwing some mutations in as well).

If done properly, this should give you a large-ish population of good solutions. If the population is large enough, the fitness function defined well enough and mutation is sufficient to get you out of any valleys you might otherwise get stuck in, you might even converge overwhelmingly on the optimal answer.

The simplest solution seems to be probabilistic. You're not looking for the absolute best lists, just a set of "good enough" lists.

1 - Pick a random starting word, add it to your list.

2 - Find the set of all highly related words (pick a sensible relatedness value based on your data). Pick one word randomly from the set of related words, make sure it doesn't relate too closely to any other words in the list. Loop this until you find one that works (then append it to your list and go back to 2 until you reach the desired list size) or exhaust all related words (discard your list and go back to 1).

3 - go back to 1 until you've constructed enough lists.

Preprocess to a different data structure:Dictionary of lists by words.

Each dictionary gets a list, sorted low to high of related words (using your proximity matrix).

Pick random word for 1st. 2nd word is lowest word in 1st's list(closest match).

3rd word is picked from at/near the end of 1st' list - ie unrelated. 4th word is from start of 3rd word's list. Repeat.

Reread reqs - as you pick each word up (close match from left and non-match from right) you need to revisit the lists of the words picked so far and make sure that the candidate's position is far enough right ie low match) from words picked so far. If not advance 1 right (next closest) or 1 left (next furthest).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM