谷歌云，大查询消耗大量内存

Question

TLDR; TLDR; querying 12.9MB in BQ costs about 540MB ram in Python. 在BQ中查询12.9MB内存在Python中大约需要540MB内存。 And this grows about linearly. 而且这线性增长。

I'm querying some bigQuery tables. 我正在查询一些bigQuery表。 Running the following query at https://bigquery.cloud.google.com/ 在https://bigquery.cloud.google.com/上运行以下查询

SELECT * FROM dataset1.table1, dataset1.table2

Results in: 结果是：

Query complete (5.2s elapsed, 12.9 MB processed)

It's about 150k rows of data. 大约有15万行数据。 When I do the same query in python the same query uses up to 540Mb of ram. 当我在python中执行相同的查询时，同一查询使用的内存高达540Mb。 If I query 300k rows this results in double the ram usage. 如果我查询300k行，这将导致ram使用率翻倍。 When I do the same query multiple times the ram usage doesn't change. 当我多次执行相同的查询时，内存使用情况不会改变。 So my best guess is that it's using some buffer that never gets freed. 所以我最好的猜测是它正在使用一些永远不会释放的缓冲区。 I've tested whether gc.collect() helped but it didn't. 我测试了gc.collect()是否有帮助，但没有成功。 I've also dumped my data to json and that file is about 25MB. 我也将数据转储到json，该文件约为25MB。 So my question is: Why is the memory usage so large, and is there any way to change it? 所以我的问题是：为什么内存使用量如此之大，有什么办法可以改变它？

My code: 我的代码：

from apiclient.discovery import build
from oauth2client.file import Storage
from oauth2client.client import OAuth2WebServerFlow
from oauth2client.tools import run
import httplib2
import sys

projectId = '....'
bqCredentialsFile = 'bigquery_credentials.dat'
clientId = '....'  # production
secret = '.......apps.googleusercontent.com '  # production

storage = Storage(bqCredentialsFile)
credentials = storage.get()
if credentials is None or credentials.invalid:
    flow = OAuth2WebServerFlow(client_id=clientId, client_secret=secret, scope='https://www.googleapis.com/auth/bigquery')
    credentials = run(flow, storage)

http = httplib2.Http()
http = credentials.authorize(http)
svc = build('bigquery', 'v2', http=http)


def getQueryResults(jobId, pageToken):
    req = svc.jobs()
    return req.getQueryResults(projectId=projectId, jobId=jobId, pageToken=pageToken).execute()


def query(queryString, priority='BATCH'):
    req = svc.jobs()
    body = {'query': queryString, 'maxResults': 100000, 'configuration': {'priority': priority}}
    res = req.query(projectId=projectId, body=body).execute()
    if 'rows' in res:
        for row in res['rows']:
            yield row
        for _ in range(int(res['totalRows']) / 100000):
            pageToken = res['pageToken']
            res = getQueryResults(res['jobReference']['jobId'], pageToken=pageToken)
            for row in res['rows']:
                yield row


def querySome(tableKeys):
    queryString = '''SELECT * FROM {0} '''.format(','.join(tableKeys))
    if len(tableKeys) > 0:
        return query(queryString, priority='BATCH')


if __name__ == '__main__':
    import simplejson as json
    tableNames = [['dataset1.table1', 'dataset1.table2']
    output = list(querySome(tableNames)) 
    fl = open('output.json', 'w')
    fl.write(json.dumps(output))
    fl.close()
    print input('done')

Answer 1

It looks to me that the issue is in the output = list(querySome(tableNames)) line. 在我看来，问题出在output = list(querySome(tableNames))行中。 I'm not a python expert, but from what I can tell, this will convert your generator into a concrete list, and require the entire results to be in memory. 我不是python专家，但是据我所知，这会将您的生成器转换为具体列表，并要求将所有结果存储在内存中。 If you iterate this line by line and write a single line at a time, you may find you have better memory usage behavior. 如果逐行迭代并一次写一行，则可能会发现您有更好的内存使用行为。

As in: 如：

output = querySome(tableNames)
fl = open('output.json', 'w')
for line in output:
  fl.write(json.dumps(output))
  fl.write('\n')
fl.close()
print input('done')

Also.... when you get query results you may get back less than 100000 rows, since BigQuery limits the size of responses. 另外...。由于BigQuery限制了回应的大小，因此当您获得查询结果时，您可能会返回少于100000的行。 Instead, you should iterate until no pageToken is returned in the response. 相反，您应该迭代直到响应中没有返回pageToken为止。

谷歌云，大查询消耗大量内存

问题描述

1 个解决方案

解决方案1
1 2014-05-22 17:10:05

谷歌云，大查询消耗大量内存

问题描述

1 个解决方案

解决方案1 1 2014-05-22 17:10:05

解决方案1
1 2014-05-22 17:10:05