简体   繁体   English

如何使用mapreduce批量更新满足查询的数据存储区实体?

[英]How to use mapreduce to bulk update datastore entities that satisfy a query?

I want to use the mapreduce library to update all entities that satisfy a query. 我想使用mapreduce库来更新满足查询的所有实体。 There are a couple of complications: 有几个并发症:

  1. The query that finds the entities to update checks if the value of a particular property "property1" is contained in a long list of values (~10000 entries) from a csv file 查找要更新的实体的查询检查特定属性“property1”的值是否包含在来自csv文件的长值列表(~10000个条目)中
  2. For each entity satisfying the query, another property "property2" needs to be updated to be equal to the value in the second column and same row of the csv file 对于满足查询的每个实体,需要将另一个属性“property2”更新为等于第二列中的值和csv文件的同一行

I know how to upload the csv file to Blobstore and read each row using a Blobstore input reader. 我知道如何将csv文件上传到Blobstore并使用Blobstore输入阅读器读取每一行。 I am also aware of the Datastore input reader that gets entities using a query. 我也知道使用查询获取实体的数据存储区输入阅读器。

My question is how can I create a Mapper class that reads input data from the Blobstore, fetches the datastore entities and updates them as efficiently as possible? 我的问题是如何创建一个Mapper类,从Blobstore读取输入数据,获取数据存储区实体并尽可能高效地更新它们?

Given that the list of possible values for property1 is long, using a query to filter doesn't seem like a good option (because you would need to use a IN filter, which actually runs one query per value ) 鉴于property1的可能值列表很长,使用查询过滤似乎不是一个好选项(因为您需要使用IN过滤器,实际上每个值运行一个查询

An alternative using MR would be to load your CSV into memory using a Map (from property1 to property2 ), and then fire a MR job that iterates all entities, and if their property1 is part of the Keys on the Map, modify it using the mapped value. 使用MR的另一种方法是使用Map(从property1property2 )将CSV加载到内存中,然后触发迭代所有实体的MR作业,如果它们的属性1是地图上的键的一部分,则使用映射值。

As @Ryan B says, you don't need to use MR for this if you just want to take advantage of batch puts, as you can use an Iterable to put using the DatastoreService. 作为@Ryan乙说,你并不需要使用MR这个,如果你只是想借此批放的优势,因为你可以使用一个Iterable使用DatastoreService。

You can use a DatastoreInputReader, and in the map function, find out if the property1 is actually in the csv: Reading from a csv each time would be very slow, what you can do is use memcache to provide that info after it is read just once from it's own Datastore model. 您可以使用DatastoreInputReader,并在map函数中查明property1是否实际位于csv中:每次从csv读取都会非常慢,您可以做的是使用memcache在读取之后提供该信息一次来自它自己的数据存储模型。 To populate the datastore model, I would recommend using property1 as the custom Id of each row, that way, querying it is straight forward. 要填充数据存储模型,我建议使用property1作为每行的自定义ID,这样,查询它是直截了当的。 You would only update the Datastore for those values that actually change and use mutation pool to make it performant (op.db.Put()). 您只需为实际更改的那些值更新数据存储区,并使用变异池使其具有高性能(op.db.Put())。 I leave you pseudo code (sorry... I only have it in python) of how the different pieces would look like, I further recommend you reading this article on Mapreduce on Google App Engine: http://sookocheff.com/posts/2014-04-15-app-engine-mapreduce-api-part-1-the-basics/ 我给你留下了伪代码(对不起......我只有它在python中),看看不同的部分会是什么样子,我进一步建议你在Google App Engine上阅读Mapreduce上的这篇文章: http ://sookocheff.com/posts/ 2014年4月15日应用内-发动机MapReduce的API-部分-1最基础/

#to get the to_dict method
from google.appengine.ext import ndb
from mapreduce import operation as op 
from mapreduce.lib import pipeline
from mapreduce import mapreduce_pipeline

class TouchPipeline(pipeline.Pipeline):
    """
    Pipeline to update the field of entities that have certain condition
    """

    def run(self, *args, **kwargs):
        """ run """
        mapper_params = {
            "entity_kind": "yourDatastoreKind",
        }
        yield mapreduce_pipeline.MapperPipeline(
            "Update entities that have certain condition",
            handler_spec="datastore_map",
            input_reader_spec="mapreduce.input_readers.DatastoreInputReader",
            params=mapper_params,
            shards=64)


class csvrow(ndb.Model):
  #you dont store property 1 because you are going to use its value as key
  substitutefield=ndb.StringProperty()

def create_csv_datastore():
  # instead of running this, make a 10,000 row function with each csv value, 
  # or read it from the blobstore, iterate and update the values accordingly
  for i in range(10000):
    #here we are using our own key as id of this row and just storing the other column that
    #eventually will be subtitute if it matches
    csvrow.get_or_insert('property%s' % i, substitutefield = 'substitute%s').put()


def queryfromcsv(property1):
  csvrow=ndb.Key('csvrow', property1).get()
  if csvrow:
    return csvrow.substitutefield
  else:
    return property1

def property1InCSV(property1):
  data = memcache.get(property1)
  if data is not None:
      return data
  else:
      data = self.queryfromcsv(property1)
      memcache.add(property1, data, 60)
      return data

def datastore_map(entity_type):
  datastorepropertytocheck = entity_type.property1
  newvalue = property1InCSV(datastorepropertytocheck)
  if newvalue!=datastoreproperty:
    entity_type.property11 = newvalue
    #use the mutation pool
    yield op.db.Put(entity)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM