简体   繁体   中英

Pyspark Luigi multiple workers issue

I want to load multiple files in spark data frame in parallel using Luigi workflow and store them in dictionary. Once all the files are loaded,i want to be able to access these data-frame from dictionary in main and then do further processing.This process is working when i am running Luigi with one worker.if running Luigi with more than one worker,this variable is empty in main method.

Any suggestion will be helpful.

 import Luigi
    from Luigi import LocalTarget
    
    from pyspark import SQLContext
    
    from src.etl.SparkAbstract import SparkAbstract
    from src.util.getSpark import  get_spark_session
    from  src.util import getSpark,read_json
    import configparser as cp
    import  datetime
    from src.input.InputCSVFileComponent import InputCSVFile
    import os
    from src.etl.Component import ComponentInfo
    
    class fileloadTask(luigi.Task):
    
        compinfo = luigi.Parameter()
    
        def output(self):
            return luigi.LocalTarget("src/workflow_output/"+str(datetime.date.today().isoformat() )+"-"+ str(self.compinfo.id)+".csv")
    
        def run(self):
    
            a = InputCSVFile(self.compinfo)  ##this class is responsible to return the object of  spark dataframe and put it in dictionary
            a.execute()
            with self.output().open('w') as f:
                f.write("done")
    
    class EnqueueTask(luigi.WrapperTask):
     compinfo = read_json.read_json_config('path to json file')
    
        def requires(self):
            folders = [
                comp.id for comp in list(self.compinfo) if comp.component_type == 'INPUTFILE'
            ]
            print(folders)
            newcominfo = []
            for index, objid in enumerate(folders):
                newcominfo.append(self.compinfo[index])
    
            for i in newcominfo:
                print(f" in compingo..{i.id}")
    
            callmethod = [fileloadTask(compinfo) for compinfo in newcominfo]
            print(callmethod)
    
            return callmethod
    
    class MainTask(luigi.WrapperTask):
    
        def requires(self):
            return EnqueueTask()
    
        def output(self):
            return luigi.LocalTarget("src/workflow_output/"+str(datetime.date.today().isoformat() )+"-"+ "maintask"+".csv")
    
        def run(self):
            print(f"printing mapdf..{SparkAbstract.mapDf}")
            res = not SparkAbstract.mapDf
            print("Is dictionary empty ? : " + str(res)) ####-------------> this is empty when workers > 1 ################
            for key, value in SparkAbstract.mapDf.items():
                print("prinitng from dict")
                print(key, value.show(10))
    
            with self.output().open('w') as f:
                f.write("done")
    
    """
    entry point for spark application
    """
    if __name__ == "__main__":
        luigi.build([MainTask()],workers=2,local_scheduler=True)

Each worker runs in its own process. That mean workers can't share python object (in this instance the dictionary in which you put the results).

Generally speaking luigi is best to orchestrate tasks with side effects (like writing to files etc).

If you you are trying to parallelise tasks that load data in memory, I'd recommand using dask instead of luigi.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM