简体   繁体   中英

find the longest word made of other words

I am working on a problem, which is to write a program to find the longest word made of other words in a list of words.

EXAMPLE
Input: test, tester, testertest, testing, testingtester
Output: testingtester

I searched and find the following solution, my question is I am confused in step 2, why we should break each word in all possible ways? Why not use each word directly as a whole? If anyone could give some insights, it will be great.

The solution below does the following:

  1. Sort the array by size, putting the longest word at the front
  2. For each word, split it in all possible ways. That is, for “test”, split it into {“t”, “est”}, {“te”, “st”} and {“tes”, “t”}.
  3. Then, for each pairing, check if the first half and the second both exist elsewhere in the array.
  4. “Short circuit” by returning the first string we find that fits condition #3.

Answering your question indirectly, I believe the following is an efficient way to solve this problem using tries .

Build a trie from all of the words in your string.

Sort the words so that the longest word comes first.

Now, for each word W, start at the top of the trie and begin following the word down the tree one letter at a time using letters from the word you are testing.

Each time a word ends, recursively re-enter the trie from the top making a note that you have "branched". If you run out of letters at the end of the word and have branched, you've found a compound word and, because the words were sorted, this is the longest compound word.

If the letters stop matching at any point, or you run out and are not at the end of the word, just back track to wherever it was that you branched and keep plugging along.

I'm afraid I don't know Java that well, so I'm unable to provide you sample code in that language. I have, however, written out a solution in Python (using a trie implementation from this answer ). Hopefully it is clear to you:

#!/usr/bin/env python3

#End of word symbol
_end = '_end_'

#Make a trie out of nested HashMap, UnorderedMap, dict structures
def MakeTrie(words):
  root = dict()
  for word in words:
    current_dict = root
    for letter in word:
      current_dict = current_dict.setdefault(letter, {})
    current_dict[_end] = _end
  return root

def LongestCompoundWord(original_trie, trie, word, level=0):
  first_letter = word[0]
  if not first_letter in trie:
    return False
  if len(word)==1 and _end in trie[first_letter]:
    return level>0
  if _end in trie[first_letter] and LongestCompoundWord(original_trie, original_trie, word[1:], level+1):
    return True
  return LongestCompoundWord(original_trie, trie[first_letter], word[1:], level)

#Words that were in your question
words = ['test','testing','tester','teste', 'testingtester', 'testingtestm', 'testtest','testingtest']

trie = MakeTrie(words)

#Sort words in order of decreasing length
words = sorted(words, key=lambda x: len(x), reverse=True)

for word in words:
  if LongestCompoundWord(trie,trie,word):
    print("Longest compound word was '{0:}'".format(word))
    break

With the above in mind, the answer to your original question becomes clearer: we do not know ahead of time which combination of prefix words will take us successfully through the tree. Therefore, we need to be prepared to check all possible combinations of prefix words.

Since the algorithm you found does not have an efficient way of knowing what subsets of a word are prefixes, it splits the word at all possible points in word to ensure that all prefixes are generated.

I guess you are just making a confusion about which words are split.

After sorting, you consider the words one after the other, by decreasing length. Let us call a "candidate" a word you are trying to decompose.

If the candidate is made of other words, it certainly starts with a word, so you will compare all prefixes of the candidate to all possible words.

During the comparison step, you compare a candidate prefix to the whole words, not to split words.


By the way, the given solution will not work for triwords and longer. The fix is as follows:

  • try every prefix of the candidate and compare it to all words
  • in case of a match, repeat the search with the suffix.

Example:

testingtester gives the prefixes

t , te , tes , test , testi , testin , testing , testingt , testingte , testingtes and testingteste

Among these, test and testing are words. Then you need to try the corresponding suffixes ingtester and tester .

ingtester gives

i , in , ing , ingt , ingte , ingtes , ingtest and ingteste , none of which are words.

tester is a word and you are done.


IsComposite(InitialCandidate, Candidate):
    For all Prefixes of Candidate:
        if Prefix is in Words:
            Suffix= Candidate - Prefix
            if Suffix == "":
                return Candidate != InitialCandidate
            else:
                return IsComposite(InitialCandidate, Suffix)

For all Candidate words by decreasing size:
    if IsComposite(Candidate, Candidate):
        print Candidate
        break

Richard's answer will work well in many cases, but it can take exponential time: this will happen if there are many segments of the string W, each of which can be decomposed in multiple different ways. For example, suppose W is abcabcabcd , and the other words are ab , c , a and bc . Then the first 3 letters of W can be decomposed either as ab|c or as a|bc ... and so can the next 3 letters, and the next 3, for 2^3 = 8 possible decompositions of the first 9 letters overall:

a|bc|a|bc|a|bc
a|bc|a|bc|ab|c
a|bc|ab|c|a|bc
a|bc|ab|c|ab|c
ab|c|a|bc|a|bc
ab|c|a|bc|ab|c
ab|c|ab|c|a|bc
ab|c|ab|c|ab|c

All of these partial decompositions necessarily fail in the end, since there is no word in the input that contains W's final letter d -- but his algorithm will explore them all before discovering this. In general, a word consisting of n copies of abc followed by a single d will take O(n*2^n) time.

We can improve this to O(n^2) worst-case time (at the cost of O(n) space) by recording extra information about the decomposability of suffixes of W as we go along -- that is, suffixes of W that we have already discovered we can or cannot match to word sequences. This type of algorithm is called dynamic programming .

The condition we need for some word W to be decomposable is exactly that W begins with some word X from the set of other words, and the suffix of W beginning at position |X|+1 is decomposable . (I'm using 1-based indices here, and I'll denote a substring of a string S beginning at position i and ending at position j by S[i..j].)

Whenever we discover that the suffix of the current word W beginning at some position i is or is not decomposable, we can record this fact and make use of it later to save time. For example, after testing the first 4 decompositions in the 8 listed earlier, we know that the suffix of W beginning at position 4 (ie, abcabcd ) is not decomposable. Then when we try the 5th decomposition, ie, the first one starting with ab , we first ask the question: Is the rest of W, ie the suffix of W beginning at position 3, decomposable? We don't know yet, so we try adding c to get ab|c , and then we ask: Is the rest of W, ie the suffix of W beginning at position 4, decomposable? And we find that it has already been found not to be -- so we can immediately conclude that no decomposition of W beginning with ab|c is possible either, instead of having to grind through all 4 possibilities.

Assuming for the moment that the current word W is fixed, what we want to build is a function f(i) that determines whether the suffix of W beginning at position i is decomposable. Pseudo-code for this could look like:

- Build a trie the same way as Richard's solution does.
- Initialise the array KnownDecomposable[] to |W| DUNNO values.

f(i):
    - If i == |W|+1 then return 1.  (The empty suffix means we're finished.)
    - If KnownDecomposable[i] is TRUE or FALSE, then immediately return it.
    - MAIN BODY BEGINS HERE
    - Walk through Richard's trie from the root, following characters in the
      suffix W[i..|W|].  Whenever we find a trie node at some depth j that
      marks the end of a word in the set:
        - Call f(i+j) to determine whether the rest of W can be decomposed.
        - If it can (i.e. if f(i+j) == 1):
            - Set KnownDecomposable[i] = TRUE.
            - Return TRUE.
    - If we make it to this point, then we have considered all other
      words that form a prefix of W[i..|W|], and found that none of
      them yield a suffix that can be decomposed.
    - Set KnownDecomposable[i] = FALSE.
    - Return FALSE.

Calling f(1) then tells us whether W is decomposable.

By the time a call to f(i) returns, KnownDecomposable[i] has been set to a non-DUNNO value (TRUE or FALSE). The main body of the function is only run if KnownDecomposable[i] is DUNNO. Together these facts imply that the main body of the function will only run as many times as there are distinct values i that the function can be called with. There are at most |W|+1 such values, which is O(n), and outside of recursive calls, a call to f(i) takes at most O(n) time to walk through Richard's trie, so overall the time complexity is bounded by O(n^2).

I would probably use recursion here. Start with the longest word and find words it starts with. For any such word remove it from the original word and continue with the remaining part in the same manner.

Pseudo code:

function iscomposed(orininalword, wordpart)
  for word in allwords
    if word <> orininalword
      if wordpart = word
        return yes
      elseif wordpart starts with word
        if iscomposed(orininalword, wordpart - word)
          return yes
        endif
      endif
    endif
  next
  return no
end  

main
  sort allwords by length descending
  for word in allwords
    if iscomposed(word, word) return word
  next
end

Example:

words:
abcdef
abcde
abc
cde
ab

Passes:

1. abcdef starts with abcde. rest = f. 2. no word f starts with found.
1. abcdef starts with abc. rest = def. 2. no word def starts with found.
1. abcdef starts with ab. rest = cdef. 2. cdef starts with cde. rest = f. 3. no word f starts with found.
1. abcde starts with abc. rest = cde. 2. cde itself found. abcde is a composed word

To find longest world using recursion

class FindLongestWord {

public static void main(String[] args) {
    List<String> input = new ArrayList<>(
            Arrays.asList("cat", "banana", "rat", "dog", "nana", "walk", "walker", "dogcatwalker"));

    List<String> sortedList = input.stream().sorted(Comparator.comparing(String::length).reversed())
            .collect(Collectors.toList());

    boolean isWordFound = false;
    for (String word : sortedList) {
        input.remove(word);
        if (findPrefix(input, word)) {
            System.out.println("Longest word is : " + word);
            isWordFound = true;
            break;
        }
    }
    if (!isWordFound)
        System.out.println("Longest word not found");

}

public static boolean findPrefix(List<String> input, String word) {

    boolean output = false;
    if (word.isEmpty())
        return true;
    else {
        for (int i = 0; i < input.size(); i++) {
            if (word.startsWith(input.get(i))) {
                output = findPrefix(input, word.replace(input.get(i), ""));
                if (output)
                    return true;
            }
        }
    }
    return output;

}

}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM