Calling mapPartitions on a class method in python

Question

I'm fairly new to Python and to Spark but let me see if I can explain what I am trying to do.

I have a bunch of different types of pages that I want to process. I created a base class for all the common attributes of those pages and then have a page specific class inherit from the base class. The idea being that the spark runner will be able to do the exact thing for all pages by changing just the page type when called.

Runner

def CreatePage(pageType):
   if pageType == "Foo":
      return PageFoo(pageType)
   elif pageType == "Bar":
      return PageBar(pageType)

def Main(pageType):
   page = CreatePage(pageType)
   pageList_rdd = sc.parallelize(page.GetPageList())
   return = pageList_rdd.mapPartitions(lambda pageNumber: CreatePage(pageType).ProcessPage(pageNumber))

print Main("Foo")

PageBaseClass.py

class PageBase(object):
   def __init__(self, pageType):
      self.pageType = None
      self.dbConnection = None

   def GetDBConnection(self):
      if self.dbConnection == None:
         # Set up a db connection so we can share this amongst all nodes.
         self.dbConnection = DataUtils.MySQL.GetDBConnection()
      return self.dbConnection

   def ProcessPage():
      raise NotImplementedError()

PageFoo.py

 class PageFoo(PageBase, pageType):
    def __init__(self, pageType):
       self.pageType = pageType
       self.dbConnetion = GetDBConnection()

    def ProcessPage():
       result = self.dbConnection.cursor("SELECT SOMETHING")
       # other processing

There are a lot of other page specific functionality that I am omitting from brevity, but the idea is that I'd like to keep all the logic of how to process that page in the page class. And, be able to share resources like db connection and an s3 bucket.

I know that the way that I have it right now, it is creating a new Page object for every item in the rdd. Is there a way to do this so that it is only creating the one object? Is there a better pattern for this? Thanks!

Answer 1

A few suggestions:

Don't create connections directly. Use connection pool (since each executor uses separate process setting pool size to one is just fine) and make sure that connections are automatically closed on timeout) instead.
Use Borg pattern to store the pool and adjust your code to use it to retrieve connections.
You won't be able to share connections between nodes or even within a single node (see comment about separate processes). The best guarantee you can get is a single connection per partition (or a number of partitions with interpreter reuse).

Calling mapPartitions on a class method in python

Question

1 answers

solution1
0 ACCPTED 2016-09-19 15:46:15

Calling mapPartitions on a class method in python

Question

1 answers

solution1 0 ACCPTED 2016-09-19 15:46:15

solution1
0 ACCPTED 2016-09-19 15:46:15