简体   繁体   English

从Google Search Console API向BigQuery流数据时发生延迟

[英]Delays when streaming data from the Google Search console API to BigQuery

So I've been trying to stream data from Google Search Console API to BigQuery in real time . 因此,我一直在尝试将数据从Google Search Console API实时流式传输到BigQuery

The data are retrieved from GSC API and streamed to the BigQuery stream buffer. 从GSC API检索数据,然后将其传输到BigQuery流缓冲区。 However, I experience high latency before the streaming buffer can be flushed (up to 2 hours or more). 但是,在刷新流缓冲区之前,我经历了高延迟(长达2小时或更长时间)。 So, the data stays in the streaming buffer but is not in the table. 因此,数据保留在流缓冲区中,但不在表中。 The data are also not visible in the preview and the table size is 0B with 0 rows (actually after waiting for >1day I still see 0B even though there are more than 0 rows). 数据在预览中也不可见,表的大小为0B,行数为0(实际上,等待超过1天后,即使行数超过0,我仍然看到0B)。

Another issue is that, some time after the data is stored in the table (table size and number of rows are correct), it simply disappears from it and appears in the streaming buffer (I only saw this once). 另一个问题是,在将数据存储到表中一段时间​​后(表大小和行数正确),它从表中消失并出现在流缓冲区中(我只看到了一次)。 -> This was explained by the second bullet in shollyman's answer. -> shollyman回答中的第二个项目符号对此进行了解释。

What I want is to have the data in the table in real time. 我想要的是将数据实时存储在表中。 According to the documentation this seems possible but doesn't work in my case (2h of delay as stated above). 根据文档,这似乎可行,但在我的情况下不起作用(如上所述,延迟2小时)。

Here's the code responsible for that part: 这是负责该部分的代码:

for row in response['rows']:
     keys = ','.join(row['keys'])

     # Data Manipulation Languate (DML) Insert one row each time to BigQuery
     row_to_stream = {'keys':keys, 'f1':row['f1'], 'f2':row['f2'], 'ctr':row['ctr'], 'position':row['position']}                    
     insert_all_data = {
         "kind": "bigquery#tableDataInsertAllRequest",
         "skipInvaliedRows": True,
         "ignoreUnknownValues": True,
         'rows':[{
                        'insertId': str(uuid.uuid4()),
                        'json': row_to_stream,
                    }]
     }

     build('bigquery', 'v2', cache_discovery=False).tabledata().insertAll(
         projectId=projectid,
         datasetId=dataset_id,
         tableId=tableid,
         body=insert_all_data).execute(num_retries=5)

I've seen questions that seem very similar to mine on here but I haven't really found an answer. 我在这里看到的问题似乎与我的非常相似,但是我还没有真正找到答案。 I therefore have 2 questions. 因此,我有2个问题。

1. What could cause this issue? 1.什么可能导致此问题?

Also, I'm new to GCP and I've seen other options (at least they seemed like options to me) for real time streaming of data to BigQuery (eg, using PubSub and a few projects around real time Twitter data analysis). 另外,我是GCP的新手,我已经看到了其他一些选项(至少对我来说它们似乎是选项),用于将数据实时流式传输到BigQuery(例如,使用PubSub和围绕Twitter实时数据分析的一些项目)。

2. How do you pick the best option for a particular task? 2.您如何为特定任务选择最佳选择?

  • By default, the BigQuery web UI doesn't automatically refresh the state of a table. 默认情况下,BigQuery网络用户界面不会自动刷新表的状态。 There is a Refresh button when you click into the details of a table, that should show you the updated size information for both managed storage and the streaming buffer (displayed below the main table details). 单击表的详细信息时有一个“ 刷新”按钮,该按钮应显示托管存储和流缓冲区的更新大小信息(显示在主表详细信息下方)。 Rows in the buffer are available to queries, but the preview button may not show results until some data is extracted from the streaming buffer to managed storage. 缓冲区中的行可用于查询,但是在将某些数据从流缓冲区提取到托管存储之前,预览按钮可能不会显示结果。

  • I suspect the case where you observed data disappearing from managed storage and appearing back in the streaming buffer may have been a case where the table was deleted and recreated with the same name, or was truncated in some fashion and streaming restarted. 我怀疑您观察到数据从托管存储中消失并重新出现在流式缓冲区中的情况可能是表被删除并以相同的名称重新创建,或者以某种方式被截断并重新开始流式传输的情况。 Data doesn't transition from managed storage back to the buffer. 数据不会从托管存储过渡回缓冲区。

  • Deciding what technology to use for streaming depends on your needs. 确定要使用哪种技术进行流传输取决于您的需求。 Pub/Sub is a great choice when you have multiple consumers of the information (multiple pub/sub subscribers consuming the same stream of messages independently), or you need to apply additional transformations of the data between the producer and consumer. 当您有多个信息使用者(多个pub / sub订户独立使用相同的消息流),或者需要在生产者和使用者之间应用数据的其他转换时,pub / sub是一个不错的选择。 To get the data from pub/sub to BigQuery, you'll still need a subscriber to write the messages into BigQuery, as the two have no direct integration. 要将数据从pub / sub获取到BigQuery,您仍然需要订阅者将消息写入BigQuery,因为两者没有直接集成。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM