Goal is to a) print a list of unique words from a text file and also b) find the longest word.
File handling and main functionality are what I want, however the list needs to be cleaned. As you can see from the output, words are getting joined with punctuation and therefore maxLength
is obviously incorrect.
with open("doc.txt") as reader, open("unique.txt", "w") as writer:
unwanted = "[],."
unique = set(reader.read().split())
unique = list(unique)
unique.sort(key=len)
regex = [elem.strip(unwanted).split() for elem in unique]
writer.write(str(regex))
reader.close()
maxLength = len(max(regex,key=len ))
print(maxLength)
res = [word for word in regex if len(word) == maxLength]
print(res)
===========
Sample:
pioneered the integrated placement year concept over 50 years ago [7][8][9] with more than 70 per cent of students taking a placement year, the highest percentage in the UK.[10]
Here's a solution that uses str.translate()
to throw away all bad characters (+ newline) before we ever do the split()
. (Normally we'd use a regex with re.sub()
, but you're not allowed.) This makes the cleaning a one-liner, which is really neat:
bad = "[],.\n"
bad_transtable = str.maketrans(bad, ' ' * len(bad))
# We can directly read and clean the entire output, without a reader object:
cleaned_input = open('doc.txt').read().translate(bad_transtable)
#with open("doc.txt") as reader:
# cleaned_input = reader.read().translate(bad_transtable)
# Get list of unique words, in decreasing length
unique_words = sorted(set(cleaned_input.split()), key=lambda w: -len(w))
with open("unique.txt", "w") as writer:
for word in unique_words:
writer.write(f'{word}\n')
max_length = len(unique_words[0])
print ([word for word in unique_words if len(word) == max_length])
Notes:
unique_words
directly! (using set()
to keep only uniques). And while we're at it, we might as well use sorted(..., key=lambda w: -len(w))
to sort it in decreasing length. Only need to call sort()
once. And no iterative-append to lists.max_length = len(unique_words[0])
for line in <lines>: for word in line.split(): ...iterative append() to wordlist
writer/reader
. open()/.close()
, that's what the with
statement does for you. (It's also more elegant for handling IO when exceptions happen.) f'{word}\n'
to add the newline back when we write()
an output linemax_length
not maxLength
. See PEP8open('doc.txt').read()
. (That's not scaleable for huge files, you'd have to read in chunks or n lines). str.maketrans()
is a builtin, but if your teacher objects to the module reference, you can also call it on a bound string eg ' '.maketrans()
str.maketrans()
is really a throwback to the days when we only had 95 printable ASCII characters, not Unicode. It still works on Unicode , but building and using huge translation dicts is annoying and uses memory, regex on Unicode is easier, you can define entire character classes.str.translate()
dirty_input = open('doc.txt').read()
cleaned_input = dirty_input
# If you can't use either 're.sub()' or 'str.translate()', have to manually
# str.replace() each bad char one-by-one (or else use a method like str.isalpha())
for bad_char in bad:
cleaned_input = cleaned_input.replace(bad_char, ' ')
And if you wanted to be ridiculously minimalist, you could write the entire output file in one line with a list-comprehension. Don't do this, it would be terrible for debugging, eg if you couldn't open/write/overwrite the output file, or got IOError, or unique_words wasn't a list, etc:
open("unique.txt", "w").writelines([f'{word}\n' for word in unique_words])
Here is a solution. The trick is to use the python str method .isalpha()
to filter non-alphanumerics.
with open("unique.txt", "w") as writer:
with open("doc.txt") as reader:
cleaned_words = []
for line in reader.readlines():
for word in line.split():
cleaned_word = ''.join([c for c in word if c.isalpha()])
if len(cleaned_word):
cleaned_words.append(cleaned_word)
# print unique words
unique_words = set(cleaned_words)
print(unique_words)
# write words to file? depends what you need here
for word in unique_words:
writer.write(str(word))
writer.write('\n')
# print length of longest
print(len(sorted(unique_words, key=len, reverse=True)[0]))
Here is another solution without any function.
bad = '`~@#$%^&*()-_=+[]{}\|;\':\".>?<,/?'
clean = ' '
for i in a:
if i not in bad:
clean += i
else:
clean += ' '
cleans = [i for i in clean.split(' ') if len(i)]
clean_uniq = list(set(cleans))
clean_uniq.sort(key=len)
print(clean_uniq)
print(len(clean_uniq[-1]))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.