简体   繁体   中英

How to extract words from a list of lists and filter words by length?

Basically I want to do two things using python: 1) Make the resulting list a list of words, not a list of lists, and 2) Filter out words that have the length of 1 character.


I have to extract words from a list of dictionaries, make the words lowercase, and filter through the words so that only words greater than the length of 1 is part of the resulting list. I have to use map() and list comprehension, but I don't really know how to do that either. I also was required to use the re.spilt() to split the words up and get rid of unwanted punctuation.

So far, I've been able to extract the relevant parts of the list of dictionaries, splitting the words up and making all the words lowercase. But what I'm getting is a list of lists whose elements are words.

I want the result to be just a list of words that have a length of 2 character or more.

def extract_tweets(some_list):
    tweetlist = []
    for each_tweet in some_list:
        text = each_tweet['text']
        lowercase = text.lower()
        tweetlist.append(lowercase)
    tweetwords = []
    for words in tweetlist:
        word = re.split('\W+', words)
        tweetwords.append(word)
    return(tweetwords)

简单的列表理解将帮助您:

tweetwords = [word for word in tweetwords if len(word) > 1]

To work, your function extract_tweets requires a list of dictionaries as argument. So some_list looks something like this:

some_list = [
    {
        'text': "Hello world!"
    },
    {
        'text': "The sun is shinning, the sky is blue."
    },
]

Actually, the first loop extracts the texts, so it's better to call it texts or text_list (instead of tweetlist ). You get:

['hello world!', 'the sun is shinning, the sky is blue.']

To extract the words of a text, it is better to use a findall instead of split because, with split you can have empty strings if the text starts or ends with a non-word, like in my examples.

To find all the words of a text, you can use:

words = re.findall('\w+', text)

note: The \\w+ RegEx will also capture the numbers and underscores. To avoid that, you ought to use the negated class [^\\W\\d_]+ .

The result of findall is a list of words. To filter the word of length greater that 1, you can use filter with a function or a comprehension list with a condition:

words = list(filter(lambda w: len(w) > 1, words))
# or:
words = [w for w in words if len(w) > 1]

Here is the refactored code:

import re
import pprint


def extract_tweets(some_list):
    texts = []
    for each_tweet in some_list:
        text = each_tweet['text']
        lowercase = text.lower()
        texts.append(lowercase)
    tweet_words = []
    for text in texts:
        words = re.findall('[^\W\d_]+', text)
        words = [w for w in words if len(w) > 1]
        tweet_words.append(words)
    return tweet_words

With the following example…

some_list = [
    {
        'text': "Hello world!"
    },
    {
        'text': "The sun is shinning, the sky is blue."
    },
    {
        'text': "1, 2, 3, four"
    },
    {
        'text': "not a word"
    },
]

pprint.pprint(extract_tweets(some_list))

… you get:

[['hello', 'world'],
 ['the', 'sun', 'is', 'shinning', 'the', 'sky', 'is', 'blue'],
 ['four'],
 ['not', 'word']]

With extend instead of append , you get:

['hello',
 'world',
 'the',
 'sun',
 'is',
 'shinning',
 'the',
 'sky',
 'is',
 'blue',
 'four',
 'not',
 'word']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM