简体   繁体   中英

Storing the content from for loop in the list python

This is a python program written in pyspark ipython notebook. I am trying to count the number of instances of words given in the list 'names' in each RDD(can be considered as file) using for loop. I want to store the count for a word in each file in a list which has same name an word.

For eg. suppose count of word harry in 1 st RDD is 1214, in 2nd RDD is 1506 n so on. I want to create a list harryList = [1214, 1506, 1825, 2933, 3748, 2617, 2887]

the list of names is dynamic.

names = ['harry', 'hermione','ron','hagrid']
rdds = [hp1RDD,hp2RDD,hp3RDD,hp4RDD,hp5RDD,hp6RDD,hp7RDD]

for n in names:
    a = []


    for x in rdds:
        a.append(x.flatMap(lambda line: line.split(" ")).filter(lambda word: word==n).count())

    print a    

with code above I can print the contents of list but I cannot save it the way shown above.

If you don't mind having:

  • words like hagrid's to be counted independently from hagrid

Using collections.Counter will help:

from collections import Counter

hp1RDD = "harry potter has a girlfriend who's name is hermione granger and a friend called ron. harry has an uncle who's name is hagrid. hagrid is a big guy"
hp2RDD = "harry potter is the best movie I've ever saw. hermione is very beautfiful"

names = ['harry', 'hermione','ron','hagrid']
rdds = [hp1RDD, hp2RDD]
results = dict()

for name in names:
    tmp_list = list()

    for rdd in rdds:
        count = Counter(rdd.split())
        tmp_list.append(count[name])
    results[name] = tmp_list

print results

Also, you could use case-insensitive version just by using lower() :

count = Counter([x.lower() for x in rdd.split()])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM