简体   繁体   中英

Logic behind program faulty, program doesn't produce correct output

This is a python code to find the token type ratio (all definitions given below in the code). I cannot get the correct value. I suspect that my logic is faulty and i am not capable o debugging my logic. I would appreciate any help

def type_token_ratio(text):
    """ 
    (list of str) -> float

    Precondition: text is non-empty. Each str in text ends with \n and
    text contains at least one word.

    Return the Type Token Ratio (TTR) for this text. TTR is the number of
    different words divided by the total number of words.

    >>> text = ['James Fennimore Cooper\n', 'Peter, Paul, and Mary\n',
        'James Gosling\n']
    >>> type_token_ratio(text)
    0.8888888888888888
    """

    x = 0
    while x < len(text):
        text[x] = text[x].replace('\n', '')
        x = x + 1
    index = 0
    counter = 0
    number_of_words = 0

    words = ' '.join(text)
    words = clean_up(words)
    words = words.replace(',', '')
    lst_of_words = words.split()

    for word1 in lst_of_words:
        while index < len(lst_of_words):
            if word1 == lst_of_words[index]:
                counter = counter + 1
            index = index + 1
    return ((len(lst_of_words) - counter)/len(lst_of_words)) 

There is a far simpler way of doing this - using the collections modules :

import collections 

def type_token_ratio(text): 
   """ (list of str) -> float

   Precondition: text is non-empty. Each str in text ends with \n and
   text contains at m one word.

   Return the Type Token Ratio (TTR) for this text. TTR is the number of
   different words divided by the total number of words.

   >>> text = ['James Fennimore Cooper\n', 'Peter, Paul, and Mary\n',
       'James Gosling\n']
   >>> type_token_ratio(text)
   0.8888888888888888
   """
   words = " ".join(text).split() # Give a list of all the words
   counts = collections.Counter(words)
   all = sum([counts[i] for i in counts])
   unique = len(counts)
   return float(unique)/all 

or as @Yoel pointed out- there is an even simpler way :

  def type_token_ratio(text): 
       words = " ".join(text).split() # Give a list of all the words
       return len(set(words))/float(len(words))

Here what you might have wanted to write ( replacing your code starting from -for- ).

 init_index=1
 for word1 in lst_of_words:
    index=init_index
    while index < len(lst_of_words):
        if word1 == lst_of_words[index]:
            counter=counter+1
            break
        index = index + 1
    init_index = init_index + 1
    print word1
 print counter
 r=(float(len(lst_of_words) - counter))/len(lst_of_words) 
 print '%.2f' % r
 return r

=> index=init_index is in fact index of word following word1; while search always restarts at next word.

=> break : to not count multiple times same occurence, one occurence by for iteration.

you are searching if there is a word that is duplicate of this in the remaining of the list ( since up to this word was already done by previous iterations )

care should be done to not recount many time smae occurence, that's why there is a break. if there are mutliple occurence of the same word, next occurence will be found at a further iteration.

not bullet proof, based on your code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM