简体   繁体   English

BigQuery无法准确返回结果

[英]BigQuery not accurately returning results

I'm using GoogleApp Engine and occasionally when I send a query to BigQuery via the JSON API, I will get incorrect results. 我正在使用GoogleApp Engine,偶尔通过JSON API向BigQuery发送查询时,会得到不正确的结果。 It is usually only confined to a single table within BigQuery (I make a new table for every batch job that is created). 它通常只限于BigQuery中的单个表(我为创建的每个批处理作业创建一个新表)。 When I run into this issue in production, I log the Query i submitted and try running it via the BigQuery dashboard which runs longer than expected but returns the expected results. 在生产中遇到此问题时,我记录了我提交的查询,并尝试通过BigQuery仪表板运行该查询,该仪表板的运行时间比预期的要长,但会返回预期的结果。

There is nothing in the response indicating an issue. 响应中没有任何内容表明存在问题。 the jobComplete comes back as True but I see no rows , just the jobReference , schema , and totalRows = 0 . jobComplete返回为True但我看不到任何rows ,只有jobReferenceschematotalRows = 0

In such situations is is appropriate to do a call to get the job results even though I should expect the current call to return the results? 在这种情况下,即使我希望当前的调用返回结果,还是应该进行调用以获取工作结果?

Relevant Code: 相关代码:

http = httplib2.Http(memcache)
self.credentials = AppAssertionCredentials(scope='https://www.googleapis.com/auth/bigquery')
self.http = self.credentials.authorize(http=http)
self.service = build('bigquery','v2',http=self.http)
jobs = self.service.jobs()
result = jobs.query(projectId=settings.GOOGLE_APIS_PROJECT_ID,
                                body={'query': query}).execute()

Response: 响应:

{u'totalRows': u'0', u'kind': u'bigquery#queryResponse', u'jobComplete': True, u'jobReference': {u'projectId': u'<REMOVED>', u'jobId': u'<REMOVED>'}, u'schema': {u'fields': [<REMOVED>]}}

No matter how many times I try to re-run the query in production, the same results are returned (Could this be due to the caching done via memcache with incorrect results being cached as a response?) 无论我尝试在生产环境中重新运行该查询多少次,都将返回相同的结果(这可能是由于通过memcache进行的缓存,并且错误的结果被缓存为响应吗?)

The issue was a mix of the following: 问题是以下各项的混合:

  1. The shared http object is NOT threadsafe! 共享的http对象不是线程安全的! (https://developers.google.com/api-client-library/python/guide/thread_safety). (https://developers.google.com/api-client-library/python/guide/thread_safety)。 Although most exmaples of usign BigQuery on GAE use a shared httplib2 object, this is incorrect usage. 尽管在GAE上大多数usign BigQuery的示例都使用共享的httplib2对象,但这是不正确的用法。 Only the credentials store is threadsafe and can be shared 仅凭据存储区是线程安全的并且可以共享
  2. There is 10s timeout on queries on BigQuery. BigQuery查询的超时时间为10秒。

I was doing multiple calls to BigQuery in parallel using a shared http object & taskqueues and the queries were taking over 10s to complete. 我正在使用共享的http对象和任务队列并行调用BigQuery,而查询要花10多个时间才能完成。 This is why responses would get mixed between calls and the results would not be as expected. 这就是为什么呼叫之间的响应混在一起,结果却不符合预期的原因。 Eg - I sometimes received the discovery response to my query request 例如-我有时收到对查询请求的发现回复

The Fix: 解决方法:

Re-write my BigQuery client code to not share the httplib2 object between calls and de-couple my process to submit BigQuery jobs to run queries vs using the query() call. 重新编写我的BigQuery客户端代码,以在调用之间不共享httplib2对象,并解耦我的过程以提交BigQuery作业以运行查询,而不是使用query()调用。 There is a lot more overhead in managing the calls and checking on statuses and receiving results, but at least it works now and the responses make sense. 管理呼叫,检查状态和接收结果还有很多开销,但是至少现在可以正常工作,并且响应是有意义的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM