谷歌雲，大查詢消耗大量內存

Question

TLDR; 在BQ中查詢12.9MB內存在Python中大約需要540MB內存。 而且這線性增長。

我正在查詢一些bigQuery表。 在https://bigquery.cloud.google.com/上運行以下查詢

SELECT * FROM dataset1.table1, dataset1.table2

結果是：

Query complete (5.2s elapsed, 12.9 MB processed)

大約有15萬行數據。 當我在python中執行相同的查詢時，同一查詢使用的內存高達540Mb。 如果我查詢300k行，這將導致ram使用率翻倍。 當我多次執行相同的查詢時，內存使用情況不會改變。 所以我最好的猜測是它正在使用一些永遠不會釋放的緩沖區。 我測試了gc.collect()是否有幫助，但沒有成功。 我也將數據轉儲到json，該文件約為25MB。 所以我的問題是：為什么內存使用量如此之大，有什么辦法可以改變它？

我的代碼：

from apiclient.discovery import build
from oauth2client.file import Storage
from oauth2client.client import OAuth2WebServerFlow
from oauth2client.tools import run
import httplib2
import sys

projectId = '....'
bqCredentialsFile = 'bigquery_credentials.dat'
clientId = '....'  # production
secret = '.......apps.googleusercontent.com '  # production

storage = Storage(bqCredentialsFile)
credentials = storage.get()
if credentials is None or credentials.invalid:
    flow = OAuth2WebServerFlow(client_id=clientId, client_secret=secret, scope='https://www.googleapis.com/auth/bigquery')
    credentials = run(flow, storage)

http = httplib2.Http()
http = credentials.authorize(http)
svc = build('bigquery', 'v2', http=http)


def getQueryResults(jobId, pageToken):
    req = svc.jobs()
    return req.getQueryResults(projectId=projectId, jobId=jobId, pageToken=pageToken).execute()


def query(queryString, priority='BATCH'):
    req = svc.jobs()
    body = {'query': queryString, 'maxResults': 100000, 'configuration': {'priority': priority}}
    res = req.query(projectId=projectId, body=body).execute()
    if 'rows' in res:
        for row in res['rows']:
            yield row
        for _ in range(int(res['totalRows']) / 100000):
            pageToken = res['pageToken']
            res = getQueryResults(res['jobReference']['jobId'], pageToken=pageToken)
            for row in res['rows']:
                yield row


def querySome(tableKeys):
    queryString = '''SELECT * FROM {0} '''.format(','.join(tableKeys))
    if len(tableKeys) > 0:
        return query(queryString, priority='BATCH')


if __name__ == '__main__':
    import simplejson as json
    tableNames = [['dataset1.table1', 'dataset1.table2']
    output = list(querySome(tableNames)) 
    fl = open('output.json', 'w')
    fl.write(json.dumps(output))
    fl.close()
    print input('done')

Answer 1

在我看來，問題出在output = list(querySome(tableNames))行中。 我不是python專家，但是據我所知，這會將您的生成器轉換為具體列表，並要求將所有結果存儲在內存中。 如果逐行迭代並一次寫一行，則可能會發現您有更好的內存使用行為。

如：

output = querySome(tableNames)
fl = open('output.json', 'w')
for line in output:
  fl.write(json.dumps(output))
  fl.write('\n')
fl.close()
print input('done')

另外...。由於BigQuery限制了回應的大小，因此當您獲得查詢結果時，您可能會返回少於100000的行。 相反，您應該迭代直到響應中沒有返回pageToken為止。

谷歌雲，大查詢消耗大量內存

問題描述

1 個解決方案

解決方案1
1 2014-05-22 17:10:05

谷歌雲，大查詢消耗大量內存

問題描述

1 個解決方案

解決方案1 1 2014-05-22 17:10:05

解決方案1
1 2014-05-22 17:10:05