简体   繁体   中英

Why does 60GB memory disappear on a MySQL connector fetchall()?

MySQL 5.7.18
Python 2.7.5
Pandas 0.17.1
CentOS 7.3

A MySQL table:

CREATE TABLE test (
  id varchar(12)
) ENGINE=InnoDB;

The size is 10GB.

select round(((data_length) / 1024 / 1024 / 1024)) "GB"
from information_schema.tables 
where table_name = "test"

10GB

The box has 250GB memory:

$ free -hm
              total        used        free      shared  buff/cache   available
Mem:           251G         15G        214G        2.3G         21G        232G
Swap:          2.0G        1.2G        839M

Select the data:

import psutil
print '1 ' + str(psutil.phymem_usage())

import os
import sys
import time
import pyodbc 
import mysql.connector
import pandas as pd
from datetime import date
import gc
print '2 ' + str(psutil.phymem_usage())

db = mysql.connector.connect({snip})
c = db.cursor()
print '3 ' + str(psutil.phymem_usage())

c.execute("select id from test")
print '4 ' + str(psutil.phymem_usage())

e=c.fetchall()
print 'getsizeof: ' + str(sys.getsizeof(e))
print '5 ' + str(psutil.phymem_usage())

d=pd.DataFrame(e)
print d.info()
print '6 ' + str(psutil.phymem_usage())

c.close()
print '7 ' + str(psutil.phymem_usage())

db.close()
print '8 ' + str(psutil.phymem_usage())

del c, db, e
print '9 ' + str(psutil.phymem_usage())

gc.collect()
print '10 ' + str(psutil.phymem_usage())

time.sleep(60)
print '11 ' + str(psutil.phymem_usage())

The output:

1 svmem(total=270194331648L, available=249765777408L, percent=7.6, used=39435464704L, free=230758866944L, active=20528222208, inactive=13648789504, buffers=345387008L, cached=18661523456)
2 svmem(total=270194331648L, available=249729019904L, percent=7.6, used=39472222208L, free=230722109440L, active=20563484672, inactive=13648793600, buffers=345387008L, cached=18661523456)
3 svmem(total=270194331648L, available=249729019904L, percent=7.6, used=39472222208L, free=230722109440L, active=20563484672, inactive=13648793600, buffers=345387008L, cached=18661523456)
4 svmem(total=270194331648L, available=249729019904L, percent=7.6, used=39472222208L, free=230722109440L, active=20563484672, inactive=13648793600, buffers=345387008L, cached=18661523456)
getsizeof: 1960771816
5 svmem(total=270194331648L, available=181568315392L, percent=32.8, used=107641655296L, free=162552676352L, active=88588271616, inactive=13656334336, buffers=345395200L, cached=18670243840)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 231246823 entries, 0 to 231246822
Data columns (total 1 columns):
0    object
dtypes: object(1)
memory usage: 3.4+ GB
None
6 svmem(total=270194331648L, available=181571620864L, percent=32.8, used=107638353920L, free=162555977728L, active=88587603968, inactive=13656334336, buffers=345395200L, cached=18670247936)
7 svmem(total=270194331648L, available=181571620864L, percent=32.8, used=107638353920L, free=162555977728L, active=88587603968, inactive=13656334336, buffers=345395200L, cached=18670247936)
8 svmem(total=270194331648L, available=181571620864L, percent=32.8, used=107638353920L, free=162555977728L, active=88587603968, inactive=13656334336, buffers=345395200L, cached=18670247936)
9 svmem(total=270194331648L, available=183428308992L, percent=32.1, used=105781678080L, free=164412653568L, active=86735921152, inactive=13656334336, buffers=345395200L, cached=18670260224)
10 svmem(total=270194331648L, available=183428308992L, percent=32.1, used=105781678080L, free=164412653568L, active=86735921152, inactive=13656334336, buffers=345395200L, cached=18670260224)
11 svmem(total=270194331648L, available=183427203072L, percent=32.1, used=105782812672L, free=164411518976L, active=86736560128, inactive=13656330240, buffers=345395200L, cached=18670288896)

I even deleted the database connection and called garbage collection.

How could a 10GB table use up 60GB of my memory?

The short answer: python data structures memory overhead.

You have a table with ~231M rows taking ~10GB, so each row has about 4 bytes.

fetchall translate that into a list of tuples like this:

[('abcd',), ('1234',), ... ]

Your list has ~231M elements and uses ~19GB of memory: on average each tuple uses 8.48 bytes.

$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys

A tuple:

>>> a = ('abcd',)
>>> sys.getsizeof(a)
64

A list of one tuple:

>>> al = [('abcd',)]
>>> sys.getsizeof(al)
80

A list of two tuples:

>>> al2 = [('abcd',), ('1234',)]
>>> sys.getsizeof(al2)
88

A list with 10 tuples:

>>> al10 = [ ('abcd',) for x in range(10)]
>>> sys.getsizeof(al10)
200

A list with 1M tuples:

>>> a_realy_long = [ ('abcd',) for x in range(1000000)]
>>> sys.getsizeof(a_realy_long )
8697472

Almost our number: 8.6 bytes per tuple in the list.

Unfortunately there isn't much you can do here: mysql.connector chooses the data structure and dict cursor would use even more memory.

If you need to reduce memory usage you must use fetchmany with suitable size argument.

Edit: pd.read_sql only accepts SQLAlchemy connections. Start by using create_engine from SQLAlchemy to connect to your database:

from sqlalchemy import create_engine
engine = create_engine('mysql://database')

then call .connect() on the resulting object:

connection = engine.connect()

Pass that connection to pd.read_sql :

df = pd.read_sql("select id from test", connection)

This should decrease your memory footprint.

Would you mind posting the memory usage results after you try the above?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM