I am new to python and am trying to find the largest word in the alice_in_worderland.txt. I think I have a good system set up ("See Below"), but my output is returning a "word" with dashes connecting multiple words. Is there someway to remove the dashes in the input of the file? For the text file visit here
sample from text file:
That's very important,' the King said, turning to the jury. They were just beginning to write this down on their slates, when the White Rabbit interrupted: UNimportant, your Majesty means, of course,' he said in a very respectful tone, but frowning and making faces at him as he spoke. " UNimportant, of course, I meant,' the King hastily said, and went on to himself in an undertone, important--unimportant-- unimportant--important--' as if he were trying which word sounded best."
code:
#String input
with open("alice_in_wonderland.txt", "r") as myfile:
string=myfile.read().replace('\n','')
#initialize list
my_list = []
#Split words into list
for word in string.split(' '):
my_list.append(word)
#initialize list
uniqueWords = []
#Fill in new list with unique words to shorten final printout
for i in my_list:
if not i in uniqueWords:
uniqueWords.append(i)
#Legnth of longest word
count = 0
#Longest word place holder
longest = []
for word in uniqueWords:
if len(word)>count:
longest = word
count = len(longest)
print longest
>>> import nltk # pip install nltk
>>> nltk.download('gutenberg')
>>> words = nltk.corpus.gutenberg.words('carroll-alice.txt')
>>> max(words, key=len) # find the longest word
'disappointment'
Here's one way using re
and mmap
:
import re
import mmap
with open('your alice in wonderland file') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
words = re.finditer('\w+', mf)
print max((word.group() for word in words), key=len)
# disappointment
Far more efficient than loading the file to physical memory.
Use str.replace
to replace the dashes with spaces (or whatever you want). To do this, simply add another call to replace after the first call on line 3:
string=myfile.read().replace('\n','').replace('-', ' ')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.