简体   繁体   中英

Flatten list of lists with a twist

I have the following data structure:

 a= [
       [u'happy', u'thursday', u'from', u'my', u'big', u'sweater', u'and', u'this', 
        u'ART', u'@', u'East', u'Village', u',', u'Manhattan', u'https', 
        u':', u'//t.co/5k8PUInmqK'],
       [u'RT', u'@', u'MayorKev', u':', u'IM', u'SO', u'HYPEE', u'@', u'calloutband', 
        u'@', u'FreakLikeBex', u'#', u'Callout', u'#', u'TheBitterEnd', u'#',
        u'Manhattan', u'#', u'Music', u'#', u'LiveMusic', u'#', u'NYC', 
         u'#', u'NY', u'#',
        u'Jersey', u'#', u'NJ', u'http', u':', u'//t.co/0\u2026']
     ]

The way I see this, it is a list of lists of strings, except it is enveloped by a pair of [ ] rather than of ( ). The pair of [ ] is system generated as a result of:

a = [nltk.tokenize.word_tokenize(tweetL) for tweetL in tweetList]

Ultimately, I need to flatten this structure to a list of strings and conduct some regex and counting operations on the words but the outer pair of [ ] is preventing this.

I tried to use:

list.extend()

and

ll = len(a)
for n in xrange(ll):
    print 'list - ', a[n], 'number = ', n

but still get the same result:

list - [ number =  1
list - u number =  2
list - ' number =  3
list - h number =  4
list - a number =  5
list - p number =  6
list - p number =  7

As you can see, the code considers every symbol of the string as a list's element rather than considering a whole string as an element

What can be done efficiently?

tried this:

flat_list = [i for sublist in a for i in sublist] 
for i in flat_list:
    print 'element - ', i

result (partial):

element -  h
element -  a
element -  p
element -  p
element -  y
element -   
element -  t

I am not sure I quite understand your question, let me know if I am way off, however, based on the input you provided, you have a list of lists. Not just that, but if that is the structure you always have, you can just take out what you need with

a = a[0]

That would simply give you single list.

Then you can simply just iterate as:

for i in a:
    print(i)

However, if that is just a sample, and you in fact have something like this:

[[],[],[],[]]

And you want to completely flatten that to a single list, then the comprehension you want to use is this:

flat_list = [i for sublist in a for i in sublist] 

Then you simply have a single list as: [1, 2, 3, 4]

Then you simply iterate over what you want:

for i in flat_list:
    print(i)

Alternatively, if you are wanting to print out the index as well then you can do this:

for i, v in enumerate(flat_list):
    print("{}: {}".format(i, v))

Just a final comment about your usage of extend.

extend as the help for the method states:

extend(...)
    L.extend(iterable) -- extend list by appending elements from the iterable

So, it's usage "extends" the list as is done by this example:

a = [1, 2, 3]
b = [4, 5, 6]
a.extend(b)
# a will now be [1, 2, 3, 4, 5, 6]

Running your input:

a = [[u'happy', u'thursday', u'from', u'my', u'big', u'sweater', u'and', u'this', u'ART', u'@', u'East', u'Village', u',', u'Manhattan', u'https', u':', u'//t.co/5k8PUInmqK'], [u'RT', u'@', u'MayorKev', u':', u'IM', u'SO', u'HYPEE', u'@', u'calloutband', u'@', u'FreakLikeBex', u'#', u'Callout', u'#', u'TheBitterEnd', u'#', u'Manhattan', u'#', u'Music', u'#', u'LiveMusic', u'#', u'NYC', u'#', u'NY', u'#', u'Jersey', u'#', u'NJ', u'http', u':', u'//t.co/0\u2026']]

on my code, yields this output:

0: happy
1: thursday
2: from
3: my
4: big
5: sweater
6: and
7: this
8: ART
9: @
10: East
11: Village
12: ,
13: Manhattan
14: https
15: :
16: //t.co/5k8PUInmqK
a= [[u'happy', u'thursday', u'from', u'my', u'big', u'sweater', u'and', u'this', u'ART', u'@', u'East', u'Village', u',', u'Manhattan', u'https', u':', u'//t.co/5k8PUInmqK'], [u'RT', u'@', u'MayorKev', u':', u'IM', u'SO', u'HYPEE', u'@', u'calloutband', u'@', u'FreakLikeBex', u'#', u'Callout', u'#', u'TheBitterEnd', u'#', u'Manhattan', u'#', u'Music', u'#', u'LiveMusic', u'#', u'NYC', u'#', u'NY', u'#', u'Jersey', u'#', u'NJ', u'http', u':', u'//t.co/0\u2026']]

from itertools import chain

flat_a = list(chain.from_iterable(a))

['happy', 'thursday', 'from', 'my', 'big', 'sweater', 'and', 'this', 'ART', '@', 'East', 'Village', ',', 'Manhattan', 'https', ':', '//t.co/5k8PUInmqK', 'RT', '@', 'MayorKev', ':', 'IM', 'SO', 'HYPEE', '@', 'calloutband', '@', 'FreakLikeBex', '#', 'Callout', '#', 'TheBitterEnd', '#', 'Manhattan', '#', 'Music', '#', 'LiveMusic', '#', 'NYC', '#', 'NY', '#', 'Jersey', '#', 'NJ', 'http', ':', '//t.co/0…']

print(flat_a)
a= [[u'happy', u'thursday', u'from', u'my', u'big', u'sweater', u'and', u'this', u'ART', u'@', u'East', u'Village', u',', u'Manhattan', u'https', u':', u'//t.co/5k8PUInmqK'], [u'RT', u'@', u'MayorKev', u':', u'IM', u'SO', u'HYPEE', u'@', u'calloutband', u'@', u'FreakLikeBex', u'#', u'Callout', u'#', u'TheBitterEnd', u'#', u'Manhattan', u'#', u'Music', u'#', u'LiveMusic', u'#', u'NYC', u'#', u'NY', u'#', u'Jersey', u'#', u'NJ', u'http', u':', u'//t.co/0\u2026']]
for L in a:
    for e in L:
        print "element "+e


element happy
element thursday
element from
element my
element big
element sweater
element and
element this
element ART
element @
element East

A nested list comprehension should solve your first problem.

a = [token for tweetL in tweetList for token in nltk.tokenize.word_tokenize(tweetL)]

This construct lets you iterate over elements found from nested for loops. The outer most for loop always comes first, then the 2nd most outer, etc. until the inner most for loop which comes last.

It might help to understand that this is equivalent to:

a = []
for tweetL in tweetList:
    for token in nltk.tokenize.word_tokenize(tweetL):
        a.append(token)

In Python 2, you can encode unicode strings with utf-8. This will convert them from unicode type to str type, which should solve the UnicodeEncodeError .

Example:

u'\u2713'.encode('utf-8')

For more information on Python 2 Unicode, you can read here: https://docs.python.org/2/howto/unicode.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM