简体   繁体   中英

How to create a list that contains only the first instance of each word found in a string (excluding punctuations, newlines, etc.)

Alright all you genius programmers and developers you... I could really use some help on this one, please.

I'm currently taking the 'Python for Everybody Specialization', that's offered through Coursera ( https://www.coursera.org/specializations/python ), and I'm stuck on an assignment.

I cannot figure out how to create a list that contains only the first instances of each word that's found in a string:

Example string:

my_string = "How much wood would a woodchuck chuck,
             if a woodchuck would chuck wood?"

Desired list:

words_list = ['How', 'much', 'wood', 'would',
              'a', 'woodchuck', 'chuck', 'if']

Thank you all for your time, consideration, and contributions!

You can build a list with words that have already been seen and filter non alphabetic characters:

my_string = "How much wood would a woodchuck chuck, if a woodchuck would chuck wood?"

new_l = []
final_l = []

for word in my_string.split():
    word = ''.join(i for i in word if i.isalpha())
    if word not in new_l:
       final_l.append(word)
       new_l.append(word)

Output:

['How', 'much', 'wood', 'would', 'a', 'woodchuck', 'chuck', 'if']

This can be accomplished in 2 steps, first remove punctuation and then add the words to a set which will remove duplicates.

Python 3:

from string import punctuation #  This is a string of all ascii punctuation characters

trans = str.maketrans('', '', punctuation)

text = 'How much wood would a woodchuck chuck, if a woodchuck would chuck wood?'.translate(trans)

words = set(text.split())

Pyhton 2:

from string import punctuation #  This is a string of all ascii punctuation characters

text = 'How much wood would a woodchuck chuck, if a woodchuck would chuck wood?'.translate(None, punctuation)

words = set(text.split())

You can use the re module and cast result to a set in order to remove duplicates:

>>> import re

>>> my_string = "How much wood would a woodchuck chuck, if a woodchuck would chuck wood?"
>>> words_list = re.findall(r'\w+', my_string)  # Find all words in your string (without punctuation)
>>> words_list_unique = sorted(set(words_list), key=words_list.index)  # Cast your result to a set in order to remove duplicates. Then cast again to a list.

>>> print(words_list_unique)
['How', 'much', 'wood', 'would', 'a', 'woodchuck', 'chuck', 'if']

Explanation:

  • \\w means character , \\w+ means word .
  • So you use re.findall(r'\\w+', my_string) in order to find all the words in my_string .
  • A set is a collection with unique elements, so you cast your result list from re.findall() into a set.
  • Then you recast to a list ( sorted ) in order to get a list with unique words from your string.
  • EDIT - If you want to preserve the order of the words, you can use sorted() with a key=words_list.index in order to keep them ordered, because set s are unordered collections.

Since all instances of a word are identical, I'm going to take the question to mean that you want a unique list of words that appear in the string. Probably the easiest way to do this is:

import re
non_unique_words = re.findall(r'\w+', my_string)
unique_words = list(set(non_unique_words))

The 're.findall' command will return any word, and converting to a set and back to a list will make the results unique.

Try it:

my_string = "How much wood would a woodchuck chuck, if a woodchuck would chuck wood?"
def replace(word, block):
    for i in block:
        word = word.replace(i, '')
    return word
my_string = replace(my_string, ',?')
result = list(set(my_string.split()))

If you need to preserve the order the words appear in:

import string
from collections import OrderedDict

def unique_words(text):
    without_punctuation = text.translate({ord(c): None for c in string.punctuation})
    words_dict = OrderedDict((k, None) for k in without_punctuation.split())
    return list(words_dict.keys())

unique_words("How much wood would a woodchuck chuck, if a woodchuck would chuck wood?")
# ['How', 'much', 'wood', 'would', 'a', 'woodchuck', 'chuck', 'if']

I use OrderedDict because there does not appear to be an ordered set in the Python standard library.

Edit:

To make the word list case insensitive one could make the dictionary keys lowercase: (k.lower(), None) for k in ...

It should be sufficient to find all of the words, and then filter out the duplicates.

words = re.findall('[a-zA-Z]+', my_string)
words_list = [w for idx, w in enumerate(words) if w not in words[:idx]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM