I'm constructing lists of words for a psychology experiment. We're trying to create a chain of words such that words adjacent in the list are related, but all other words in the list are not related. So for example:
SCHOOL, CAFETERIA, PIZZA, CRUST, EARTH, OCEAN, WHALE, ...
So here we see the first word is related to the second, and the second is related to the third, but the third isn't related to the first. (And the first isn't related to the fourth, fifth, sixth, ... either)
I have a list of 1600 words such that each number from 0 to 1600 corresponds to a word. I also have a very large matrix (1600 x 1600) that tells me (on a scale of 0 to 1) how related each word is to every other word. (These are from a latent semantic analysis; http://lsa.colorado.edu/ )
I can make the lists, but it's not very efficient at all, and my adjacent words aren't super strongly related to each other.
Does anyone have a better approach? The problem with my approach is that I'll sometimes settle for an okay match that meets my criteria when a better match is potentially still out there. Also, it would be nice if I didn't have to dump the whole list but could backtrack a few steps to where the list potentially went wrong.
Loop through our words... If it meets the criteria, add it to the list.
This seems to be the point of issue. You are stopping at the first match, not the best match. Using your 1600 square matrix of relatedness values, you can simply get the index of the maximum relatedness value for the remaining words, then go the word matrix and add the corresponding word to the list.
Try nltk Natural Language Toolkit from www.nltk.org . NLTK may already have something similar to what you are looking for.
This might be a good candidate for a genetic algorithm. You can create a large number of completely random possibilities, score each one with an objective function, and then iterate the population by crossing over mates based on fitness (possibly throwing some mutations in as well).
If done properly, this should give you a large-ish population of good solutions. If the population is large enough, the fitness function defined well enough and mutation is sufficient to get you out of any valleys you might otherwise get stuck in, you might even converge overwhelmingly on the optimal answer.
The simplest solution seems to be probabilistic. You're not looking for the absolute best lists, just a set of "good enough" lists.
1 - Pick a random starting word, add it to your list.
2 - Find the set of all highly related words (pick a sensible relatedness value based on your data). Pick one word randomly from the set of related words, make sure it doesn't relate too closely to any other words in the list. Loop this until you find one that works (then append it to your list and go back to 2 until you reach the desired list size) or exhaust all related words (discard your list and go back to 1).
3 - go back to 1 until you've constructed enough lists.
Preprocess to a different data structure:Dictionary of lists by words.
Each dictionary gets a list, sorted low to high of related words (using your proximity matrix).
Pick random word for 1st. 2nd word is lowest word in 1st's list(closest match).
3rd word is picked from at/near the end of 1st' list - ie unrelated. 4th word is from start of 3rd word's list. Repeat.
Reread reqs - as you pick each word up (close match from left and non-match from right) you need to revisit the lists of the words picked so far and make sure that the candidate's position is far enough right ie low match) from words picked so far. If not advance 1 right (next closest) or 1 left (next furthest).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.