简体   繁体   中英

How to count same item with multi parameters in mrjob in python?

I'm trying to write a map-reduce function in python. I have a file that contains product information and I want to count the number of products that are members of the same category and have the same version. like this: <category, {count, version} >

My file information is as follows:

  product_name   rate   category   id  version
       a           "3.0"   cat1       1     1
       b           "2.0"   cat1       2     1
       c           "4.0"   cat1       3     4
       d           "1.0"   cat2       3     2
       .             .      .         .     .
       .             .      .         .     .
       .             .      .         .     .

for example:

   <cat1, {2, 1} >

I wrote this code but in combiner function I don't know how can I count them.

from mrjob.job import MRJob
from mrjob.step import MRStep

class MRFrequencyCount(MRJob):

    def steps(self):
        return [
            MRStep(
                mapper=self.mapper_extract_words,
                combiner=self.combine_word_counts,
            )
        ]

    def mapper_extract(self, _, line):
        (product_name, rate, category, id, version) = line.split('*')
        yield category, (1, version)

    def combine_counts(self, category, countAndVersion):
        yield category, sum(countAndVersion)

if __name__ == '__main__':
    MRFrequencyCount.run()

The issue is the key you are creating. Since you are essentially grouping by Category and Version you should send that as the composite key to the combiner function. The reducer can then break down the composite key and emit the data in the desired format.

from mrjob.job import MRJob
from mrjob.step import MRStep

class MRFrequencyCount(MRJob):

    def steps(self):
        return [
            MRStep(
                mapper=self.mapper_extract,
                combiner=self.combine_counts,
                reducer=self.reduce_counts
            )
        ]

    def mapper_extract(self, _, line):
        (product_name, rate, category, id, version) = line.split('*')
        yield (category, version), 1

    def combine_counts(self, cat_version, count):
        yield cat_version, sum(count)

    def reduce_counts(self, cat_version, counts):
        category, version = cat_version
        final = sum(counts)
        yield category, (final, version)

if __name__ == '__main__':
    MRFrequencyCount.run()

a*3.0*cat1*1*1
b*2.0*cat1*2*1
c*4.0*cat1*3*4
d*1.0*cat2*3*2

"cat2"  [1, "2"]
"cat1"  [1, "4"]
"cat1"  [2, "1"]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM