简体   繁体   中英

Spark Structured Streaming - Empty dictionary on new batch

In my constructor I initialize an empty dictionary and then in a udf I update it with the new data that arrived from the batch.

My problem is that in every new batch the dictionary is empty again.

How can I bypass the empty step, so new batches have access to all previous values I have already added in my dictionary ?

import CharacteristicVector
import update_charecteristic_vector

class SomeClass(object):

    def __init__(self):
        self.grid_list = {}

    def run_stream(self):   

        def update_grid_list(grid):
            if grid not in self.grid_list:
                grid_list[grid] = 
            if grid not in self.grid_list:
                self.grid_list[grid] = CharacteristicVector()
            self.grid_list[grid] = update_charecteristic_vector(self.grid_list[grid])
            return self.grid_list[grid].Density
        .
        .
        .

        udf_update_grid_list = udf(update_grid_list, StringType())
        grids_dataframe = hashed.select(
            hashed.grid.alias('grid'),
            update_list(hashed.grid).alias('Density')
        )

        query = grids_dataframe.writeStream.format("console").start()
        query.awaitTermination()

Unfortunately, this code cannot work for multiple reasons. Even with single batch or in a batch application it will work only if there is only active Python worker process. Also, it is not possible in general, to have global synchronized stat, with support for both reads and writes.

You should be able to use stateful transformations , but for now, there are supported only in Java / Scala and interface is still experimental / evolving.

Depending on your requirements you can try to use in memory data grid, key-value store, or distributed cache.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM