简体   繁体   中英

Print a list of unique words from a text file after removing punctuation, and find longest word

Goal is to a) print a list of unique words from a text file and also b) find the longest word.

  • I cannot use imports in this challenge.

File handling and main functionality are what I want, however the list needs to be cleaned. As you can see from the output, words are getting joined with punctuation and therefore maxLength is obviously incorrect.

with open("doc.txt") as reader, open("unique.txt", "w") as writer:

    unwanted = "[],."
    unique = set(reader.read().split())
    unique = list(unique) 
    unique.sort(key=len)
    regex = [elem.strip(unwanted).split() for elem in unique]
    writer.write(str(regex))
    reader.close()

    maxLength = len(max(regex,key=len ))
    print(maxLength)
    res = [word for word in regex if len(word) == maxLength]
    print(res)



===========

Sample:

pioneered the integrated placement year concept over 50 years ago [7][8][9] with more than 70 per cent of students taking a placement year, the highest percentage in the UK.[10]

Here's a solution that uses str.translate() to throw away all bad characters (+ newline) before we ever do the split() . (Normally we'd use a regex with re.sub() , but you're not allowed.) This makes the cleaning a one-liner, which is really neat:

bad = "[],.\n"
bad_transtable = str.maketrans(bad, ' ' * len(bad))

# We can directly read and clean the entire output, without a reader object: 
cleaned_input = open('doc.txt').read().translate(bad_transtable)
#with open("doc.txt") as reader:
#    cleaned_input = reader.read().translate(bad_transtable)

# Get list of unique words, in decreasing length
unique_words = sorted(set(cleaned_input.split()), key=lambda w: -len(w))   

with open("unique.txt", "w") as writer:
    for word in unique_words:
        writer.write(f'{word}\n')

max_length = len(unique_words[0])
print ([word for word in unique_words if len(word) == max_length])

Notes:

  • since the input is already 100% cleaned and split, no need to append to a list/insert to a set as we go, then have to make another cleaning pass later. We can just create unique_words directly! (using set() to keep only uniques). And while we're at it, we might as well use sorted(..., key=lambda w: -len(w)) to sort it in decreasing length. Only need to call sort() once. And no iterative-append to lists.
  • hence we guarantee that max_length = len(unique_words[0])
  • this approach is also going to be more performant than nested loops for line in <lines>: for word in line.split(): ...iterative append() to wordlist
  • no need to do explicit writer/reader . open()/.close() , that's what the with statement does for you. (It's also more elegant for handling IO when exceptions happen.)
  • you could also merge the printing of the max_length words inside the writer loop. But it's cleaner code to keep them separate.
  • note we use f-string formatting f'{word}\n' to add the newline back when we write() an output line
  • in Python we use lower_case_with_underscores for variable names, hence max_length not maxLength . See PEP8
  • in fact here, we don't strictly need a with-statement for the writer, if all we're going to do is slurp its entire contents in one go in with open('doc.txt').read() . (That's not scaleable for huge files, you'd have to read in chunks or n lines).
  • str.maketrans() is a builtin, but if your teacher objects to the module reference, you can also call it on a bound string eg ' '.maketrans()
  • str.maketrans() is really a throwback to the days when we only had 95 printable ASCII characters, not Unicode. It still works on Unicode , but building and using huge translation dicts is annoying and uses memory, regex on Unicode is easier, you can define entire character classes.

Alternative solution if you don't yet know str.translate()

dirty_input = open('doc.txt').read()
cleaned_input = dirty_input
# If you can't use either 're.sub()' or 'str.translate()', have to manually
# str.replace() each bad char one-by-one (or else use a method like str.isalpha())
for bad_char in bad:
    cleaned_input = cleaned_input.replace(bad_char, ' ')

And if you wanted to be ridiculously minimalist, you could write the entire output file in one line with a list-comprehension. Don't do this, it would be terrible for debugging, eg if you couldn't open/write/overwrite the output file, or got IOError, or unique_words wasn't a list, etc:

open("unique.txt", "w").writelines([f'{word}\n' for word in unique_words])

Here is a solution. The trick is to use the python str method .isalpha() to filter non-alphanumerics.

with open("unique.txt", "w") as writer:
    with open("doc.txt") as reader:
        cleaned_words = []
        for line in reader.readlines():
            for word in line.split():
                cleaned_word = ''.join([c for c in word if c.isalpha()])
                if len(cleaned_word):
                    cleaned_words.append(cleaned_word)

        # print unique words
        unique_words = set(cleaned_words)
        print(unique_words)

        # write words to file? depends what you need here
        for word in unique_words:
            writer.write(str(word))
            writer.write('\n')

        # print length of longest
        print(len(sorted(unique_words, key=len, reverse=True)[0]))

Here is another solution without any function.

bad = '`~@#$%^&*()-_=+[]{}\|;\':\".>?<,/?'

clean = ' '
for i in a:
    if i not in bad:
        clean += i
    else:
        clean += ' '

cleans = [i for i in clean.split(' ') if len(i)]

clean_uniq = list(set(cleans))

clean_uniq.sort(key=len)

print(clean_uniq)
print(len(clean_uniq[-1]))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM