简体   繁体   English

如何获取 Google Cloud Datastore 中某种实体的总数

[英]How to get the total count of entities in a kind in Google Cloud Datastore

I have a kind having around 5 Million entities in the Google Cloud Datastore.我在谷歌云数据存储中有大约 500 万个实体。 I want to get this count programmatically using Java.我想使用 Java 以编程方式获取此计数。 I tried following code but it work upto certain threshold (800K).我尝试了以下代码,但它可以达到一定的阈值(800K)。 When i ran query for 5 M records, it goes into infinite loop (my guess) since it doesn't returns any count.当我查询 5 M 记录时,它进入无限循环(我的猜测),因为它不返回任何计数。 How to get the count of entities for this big data?如何获得这个大数据的实体数量? I would not like to use Google App Engine API since it requires to setup environment.我不想使用 Google App Engine API,因为它需要设置环境。

private static Datastore datastore;

datastore = DatastoreOptions.getDefaultInstance().getService(); 

Query query = Query.newKeyQueryBuilder().setKind(kind).build();

int count = Iterators.size(datastore.run(query)); //count has the entities count

How accurate do you need the count to be?您需要计数有多准确? For an slightly out of date count you can use a stats entity to fetch the number of entities for a kind.对于稍微过时的计数,您可以使用stats 实体来获取一种实体的数量。

If you can't use the stale counts from the stats entity, then you'll need to keep counter entities for the real time counts that you need.如果您不能使用来自 stats 实体的陈旧计数,那么您需要保留计数器实体以获得所需的实时计数。 You should consider using a sharded counter .您应该考虑使用分片 计数器

Check out Google Dataflow.查看谷歌数据流。 A pipeline like the following should do it:像下面这样的管道应该这样做:

def send_count_to_call_back(callback_url):
    def f(record_count):
        r = requests.post(callback_url, data=json.dumps({
            'record_count': record_count,
        }))
    return f

def run_pipeline(project, callback_url)
    pipeline_options = PipelineOptions.from_dictionary({
        'project': project,
        'runner': 'DataflowRunner',
        'staging_location':'gs://%s.appspot.com/dataflow-data/staging' % project,
        'temp_location':'gs://%s.appspot.com/dataflow-data/temp' % project,
        # .... other options
    })

    query = query_pb2.Query()
    query.kind.add().name = 'YOUR_KIND_NAME_GOES HERE'

    p = beam.Pipeline(options=pipeline_options)
    _ = (p
     | 'fetch all rows for query' >> ReadFromDatastore(project, query)
     | 'count rows' >> apache_beam.combiners.Count.Globally()
     | 'send count to callback' >> apache_beam.Map(send_count_to_call_back(callback_url))
    )

I use python, but they have a Java sdk too https://beam.apache.org/documentation/programming-guide/我使用 python,但他们也有一个 Java sdk https://beam.apache.org/documentation/programming-guide/

The only issue is your process will have to trigger this pipeline, let it run on its own for a few minutes, and then let it hit a callback URL to let you know it's done唯一的问题是你的进程必须触发这个管道,让它自己运行几分钟,然后让它点击一个回调 URL 让你知道它已经完成

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM