简体   繁体   中英

google cloud, big queries cost big memory

TLDR; querying 12.9MB in BQ costs about 540MB ram in Python. And this grows about linearly.

I'm querying some bigQuery tables. Running the following query at https://bigquery.cloud.google.com/

SELECT * FROM dataset1.table1, dataset1.table2

Results in:

Query complete (5.2s elapsed, 12.9 MB processed)

It's about 150k rows of data. When I do the same query in python the same query uses up to 540Mb of ram. If I query 300k rows this results in double the ram usage. When I do the same query multiple times the ram usage doesn't change. So my best guess is that it's using some buffer that never gets freed. I've tested whether gc.collect() helped but it didn't. I've also dumped my data to json and that file is about 25MB. So my question is: Why is the memory usage so large, and is there any way to change it?

My code:

from apiclient.discovery import build
from oauth2client.file import Storage
from oauth2client.client import OAuth2WebServerFlow
from oauth2client.tools import run
import httplib2
import sys

projectId = '....'
bqCredentialsFile = 'bigquery_credentials.dat'
clientId = '....'  # production
secret = '.......apps.googleusercontent.com '  # production

storage = Storage(bqCredentialsFile)
credentials = storage.get()
if credentials is None or credentials.invalid:
    flow = OAuth2WebServerFlow(client_id=clientId, client_secret=secret, scope='https://www.googleapis.com/auth/bigquery')
    credentials = run(flow, storage)

http = httplib2.Http()
http = credentials.authorize(http)
svc = build('bigquery', 'v2', http=http)


def getQueryResults(jobId, pageToken):
    req = svc.jobs()
    return req.getQueryResults(projectId=projectId, jobId=jobId, pageToken=pageToken).execute()


def query(queryString, priority='BATCH'):
    req = svc.jobs()
    body = {'query': queryString, 'maxResults': 100000, 'configuration': {'priority': priority}}
    res = req.query(projectId=projectId, body=body).execute()
    if 'rows' in res:
        for row in res['rows']:
            yield row
        for _ in range(int(res['totalRows']) / 100000):
            pageToken = res['pageToken']
            res = getQueryResults(res['jobReference']['jobId'], pageToken=pageToken)
            for row in res['rows']:
                yield row


def querySome(tableKeys):
    queryString = '''SELECT * FROM {0} '''.format(','.join(tableKeys))
    if len(tableKeys) > 0:
        return query(queryString, priority='BATCH')


if __name__ == '__main__':
    import simplejson as json
    tableNames = [['dataset1.table1', 'dataset1.table2']
    output = list(querySome(tableNames)) 
    fl = open('output.json', 'w')
    fl.write(json.dumps(output))
    fl.close()
    print input('done')

It looks to me that the issue is in the output = list(querySome(tableNames)) line. I'm not a python expert, but from what I can tell, this will convert your generator into a concrete list, and require the entire results to be in memory. If you iterate this line by line and write a single line at a time, you may find you have better memory usage behavior.

As in:

output = querySome(tableNames)
fl = open('output.json', 'w')
for line in output:
  fl.write(json.dumps(output))
  fl.write('\n')
fl.close()
print input('done')

Also.... when you get query results you may get back less than 100000 rows, since BigQuery limits the size of responses. Instead, you should iterate until no pageToken is returned in the response.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM