如何使用mapreduce批量更新滿足查詢的數據存儲區實體？

Question

我想使用mapreduce庫來更新滿足查詢的所有實體。 有幾個並發症：

查找要更新的實體的查詢檢查特定屬性“property1”的值是否包含在來自csv文件的長值列表（~10000個條目）中
對於滿足查詢的每個實體，需要將另一個屬性“property2”更新為等於第二列中的值和csv文件的同一行

我知道如何將csv文件上傳到Blobstore並使用Blobstore輸入閱讀器讀取每一行。 我也知道使用查詢獲取實體的數據存儲區輸入閱讀器。

我的問題是如何創建一個Mapper類，從Blobstore讀取輸入數據，獲取數據存儲區實體並盡可能高效地更新它們？

Answer 1

鑒於property1的可能值列表很長，使用查詢過濾似乎不是一個好選項（因為您需要使用IN過濾器，實際上每個值運行一個查詢）

使用MR的另一種方法是使用Map（從property1到property2 ）將CSV加載到內存中，然后觸發迭代所有實體的MR作業，如果它們的屬性1是地圖上的鍵的一部分，則使用映射值。

作為@Ryan乙說，你並不需要使用MR這個，如果你只是想借此批放的優勢，因為你可以使用一個Iterable來把使用DatastoreService。

Answer 2

您可以使用DatastoreInputReader，並在map函數中查明property1是否實際位於csv中：每次從csv讀取都會非常慢，您可以做的是使用memcache在讀取之后提供該信息一次來自它自己的數據存儲模型。 要填充數據存儲模型，我建議使用property1作為每行的自定義ID，這樣，查詢它是直截了當的。 您只需為實際更改的那些值更新數據存儲區，並使用變異池使其具有高性能（op.db.Put（））。 我給你留下了偽代碼（對不起......我只有它在python中），看看不同的部分會是什么樣子，我進一步建議你在Google App Engine上閱讀Mapreduce上的這篇文章： http ：//sookocheff.com/posts/ 2014年4月15日應用內-發動機MapReduce的API-部分-1最基礎/

#to get the to_dict method
from google.appengine.ext import ndb
from mapreduce import operation as op 
from mapreduce.lib import pipeline
from mapreduce import mapreduce_pipeline

class TouchPipeline(pipeline.Pipeline):
    """
    Pipeline to update the field of entities that have certain condition
    """

    def run(self, *args, **kwargs):
        """ run """
        mapper_params = {
            "entity_kind": "yourDatastoreKind",
        }
        yield mapreduce_pipeline.MapperPipeline(
            "Update entities that have certain condition",
            handler_spec="datastore_map",
            input_reader_spec="mapreduce.input_readers.DatastoreInputReader",
            params=mapper_params,
            shards=64)


class csvrow(ndb.Model):
  #you dont store property 1 because you are going to use its value as key
  substitutefield=ndb.StringProperty()

def create_csv_datastore():
  # instead of running this, make a 10,000 row function with each csv value, 
  # or read it from the blobstore, iterate and update the values accordingly
  for i in range(10000):
    #here we are using our own key as id of this row and just storing the other column that
    #eventually will be subtitute if it matches
    csvrow.get_or_insert('property%s' % i, substitutefield = 'substitute%s').put()


def queryfromcsv(property1):
  csvrow=ndb.Key('csvrow', property1).get()
  if csvrow:
    return csvrow.substitutefield
  else:
    return property1

def property1InCSV(property1):
  data = memcache.get(property1)
  if data is not None:
      return data
  else:
      data = self.queryfromcsv(property1)
      memcache.add(property1, data, 60)
      return data

def datastore_map(entity_type):
  datastorepropertytocheck = entity_type.property1
  newvalue = property1InCSV(datastorepropertytocheck)
  if newvalue!=datastoreproperty:
    entity_type.property11 = newvalue
    #use the mutation pool
    yield op.db.Put(entity)

如何使用mapreduce批量更新滿足查詢的數據存儲區實體？

問題描述

2 個解決方案

解決方案1
3 2015-01-19 17:11:46

解決方案2
2 2015-01-25 01:53:42

如何使用mapreduce批量更新滿足查詢的數據存儲區實體？

問題描述

2 個解決方案

解決方案1 3 2015-01-19 17:11:46

解決方案2 2 2015-01-25 01:53:42

解決方案1
3 2015-01-19 17:11:46

解決方案2
2 2015-01-25 01:53:42