Alright all you genius programmers and developers you... I could really use some help on this one, please.
I'm currently taking the 'Python for Everybody Specialization', that's offered through Coursera ( https://www.coursera.org/specializations/python ), and I'm stuck on an assignment.
I cannot figure out how to create a list that contains only the first instances of each word that's found in a string:
Example string:
my_string = "How much wood would a woodchuck chuck,
if a woodchuck would chuck wood?"
Desired list:
words_list = ['How', 'much', 'wood', 'would',
'a', 'woodchuck', 'chuck', 'if']
Thank you all for your time, consideration, and contributions!
You can build a list with words that have already been seen and filter non alphabetic characters:
my_string = "How much wood would a woodchuck chuck, if a woodchuck would chuck wood?"
new_l = []
final_l = []
for word in my_string.split():
word = ''.join(i for i in word if i.isalpha())
if word not in new_l:
final_l.append(word)
new_l.append(word)
Output:
['How', 'much', 'wood', 'would', 'a', 'woodchuck', 'chuck', 'if']
This can be accomplished in 2 steps, first remove punctuation and then add the words to a set which will remove duplicates.
Python 3:
from string import punctuation # This is a string of all ascii punctuation characters
trans = str.maketrans('', '', punctuation)
text = 'How much wood would a woodchuck chuck, if a woodchuck would chuck wood?'.translate(trans)
words = set(text.split())
Pyhton 2:
from string import punctuation # This is a string of all ascii punctuation characters
text = 'How much wood would a woodchuck chuck, if a woodchuck would chuck wood?'.translate(None, punctuation)
words = set(text.split())
You can use the re
module and cast result to a set
in order to remove duplicates:
>>> import re
>>> my_string = "How much wood would a woodchuck chuck, if a woodchuck would chuck wood?"
>>> words_list = re.findall(r'\w+', my_string) # Find all words in your string (without punctuation)
>>> words_list_unique = sorted(set(words_list), key=words_list.index) # Cast your result to a set in order to remove duplicates. Then cast again to a list.
>>> print(words_list_unique)
['How', 'much', 'wood', 'would', 'a', 'woodchuck', 'chuck', 'if']
Explanation:
\\w
means character , \\w+
means word . re.findall(r'\\w+', my_string)
in order to find all the words in my_string
. set
is a collection with unique elements, so you cast your result list from re.findall()
into a set. list
( sorted
) in order to get a list with unique words from your string. sorted()
with a key=words_list.index
in order to keep them ordered, because set
s are unordered collections. Since all instances of a word are identical, I'm going to take the question to mean that you want a unique list of words that appear in the string. Probably the easiest way to do this is:
import re
non_unique_words = re.findall(r'\w+', my_string)
unique_words = list(set(non_unique_words))
The 're.findall' command will return any word, and converting to a set and back to a list will make the results unique.
Try it:
my_string = "How much wood would a woodchuck chuck, if a woodchuck would chuck wood?"
def replace(word, block):
for i in block:
word = word.replace(i, '')
return word
my_string = replace(my_string, ',?')
result = list(set(my_string.split()))
If you need to preserve the order the words appear in:
import string
from collections import OrderedDict
def unique_words(text):
without_punctuation = text.translate({ord(c): None for c in string.punctuation})
words_dict = OrderedDict((k, None) for k in without_punctuation.split())
return list(words_dict.keys())
unique_words("How much wood would a woodchuck chuck, if a woodchuck would chuck wood?")
# ['How', 'much', 'wood', 'would', 'a', 'woodchuck', 'chuck', 'if']
I use OrderedDict because there does not appear to be an ordered set in the Python standard library.
Edit:
To make the word list case insensitive one could make the dictionary keys lowercase: (k.lower(), None) for k in ...
It should be sufficient to find all of the words, and then filter out the duplicates.
words = re.findall('[a-zA-Z]+', my_string)
words_list = [w for idx, w in enumerate(words) if w not in words[:idx]]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.