简体   繁体   中英

Map reduce's code python with an error 'string index out of range'

My data looks like:

 1 1.45
 1 1.153
 2 2.179
 2 2.206
 2 2.59
 2 2.111
 3 3.201
 3 3.175
 4 4.228
 4 4.161
 4 4.213

The output I want is :

 1  2  (1 occurs 2 times)
 2  4
 3  2
 4  3

For this I run the following code:

SubPatent2count = {}
for line in data.split('\n'):
    for num in line.split('\t'):
        Mapper_data = ["%s\t%d" % (num[0], 1) ]
        for line in Mapper_data:
            Sub_Patent,count = line.strip().split('\t',1)
            try:
                count = int(count)
            except ValueError:
                continue

            try:
                SubPatent2count[Sub_Patent] = SubPatent2count[Sub_Patent]+count
            except:
                SubPatent2count[Sub_Patent] = count
for Sub_Patent in SubPatent2count.keys():
    print ('%s\t%s'% ( Sub_Patent,  SubPatent2count[Sub_Patent] ))

At the end I get this error :

     3    for num in line.split('\t'):
     4         #print(num[0])
----> 5         Mapper_data = ["%s\t%d" % (num[0], 1) ]
     6         #print(Mapper_data)
     7         for line in Mapper_data:

IndexError: string index out of range

If you have any Idea how I can deal with this error please Help. Thank you!

num[0] is probably an empty string, that's why you are getting an index out of range error. Another possibility is that you are in fact separating the number in each line with empty strings, not with tabs.

Anyway, your code seems a little strange. For example, you encode the data in a string in a list of one element ( Mapped_data ) and then decode it to process it. That is really not necessary and you should avoid it.

Try this code:

from collections import Counter

decoded_data = [ int(l.split(' ', 1)[0]) for l in data.split('\n') if len(l)>0]
SubPatent2count = Counter(decoded_data)

for k in SubPatent2count:
    print k, SubPatent2count[k]

Just suggesting another approach: Have you tried with list comprehension + groupy from itertools ?

from itertools import groupby

print([(key, len(list(group))) for key, group in groupby([x.split(' ')[0] for x in data.split('\n')])])
# where [x.split(' ')[0] for x in data.split('\n')] generates a list of all starting number
# and groupy counts them

Or if you want that exact output:

from itertools import groupby

mylist = [(key, len(list(group))) for key, group in groupby([x.split(' ')[0] for x in data.split('\n')])]


for key, repetition in mylist:
    print(key, repetition)

Thank you everybody, your suggestions really helped me, I changed my code as follow:

SubPatent2count = {}
for line in data.split('\n'):
Mapper_data = ["%s\o%d" % (line.split(' ')[0], 1) ]
    for line in Mapper_data:
            Sub_Patent,count = line.strip().split('\o',1)
            try:
                count = int(count)
            except ValueError:
                continue

            try:
                SubPatent2count[Sub_Patent] = SubPatent2count[Sub_Patent]+count
           except:
                SubPatent2count[Sub_Patent] = count
for Sub_Patent in SubPatent2count.keys():
    print ('%s\t%s'% ( Sub_Patent,  SubPatent2count[Sub_Patent] )) 

And it gives the following result:

1  2  (1 occurs 2 times)
2  4
3  2
4  3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM