MySQL 5.7.18
Python 2.7.5
Pandas 0.17.1
CentOS 7.3
A MySQL table:
CREATE TABLE test (
id varchar(12)
) ENGINE=InnoDB;
The size is 10GB.
select round(((data_length) / 1024 / 1024 / 1024)) "GB"
from information_schema.tables
where table_name = "test"
10GB
The box has 250GB memory:
$ free -hm
total used free shared buff/cache available
Mem: 251G 15G 214G 2.3G 21G 232G
Swap: 2.0G 1.2G 839M
Select the data:
import psutil
print '1 ' + str(psutil.phymem_usage())
import os
import sys
import time
import pyodbc
import mysql.connector
import pandas as pd
from datetime import date
import gc
print '2 ' + str(psutil.phymem_usage())
db = mysql.connector.connect({snip})
c = db.cursor()
print '3 ' + str(psutil.phymem_usage())
c.execute("select id from test")
print '4 ' + str(psutil.phymem_usage())
e=c.fetchall()
print 'getsizeof: ' + str(sys.getsizeof(e))
print '5 ' + str(psutil.phymem_usage())
d=pd.DataFrame(e)
print d.info()
print '6 ' + str(psutil.phymem_usage())
c.close()
print '7 ' + str(psutil.phymem_usage())
db.close()
print '8 ' + str(psutil.phymem_usage())
del c, db, e
print '9 ' + str(psutil.phymem_usage())
gc.collect()
print '10 ' + str(psutil.phymem_usage())
time.sleep(60)
print '11 ' + str(psutil.phymem_usage())
The output:
1 svmem(total=270194331648L, available=249765777408L, percent=7.6, used=39435464704L, free=230758866944L, active=20528222208, inactive=13648789504, buffers=345387008L, cached=18661523456)
2 svmem(total=270194331648L, available=249729019904L, percent=7.6, used=39472222208L, free=230722109440L, active=20563484672, inactive=13648793600, buffers=345387008L, cached=18661523456)
3 svmem(total=270194331648L, available=249729019904L, percent=7.6, used=39472222208L, free=230722109440L, active=20563484672, inactive=13648793600, buffers=345387008L, cached=18661523456)
4 svmem(total=270194331648L, available=249729019904L, percent=7.6, used=39472222208L, free=230722109440L, active=20563484672, inactive=13648793600, buffers=345387008L, cached=18661523456)
getsizeof: 1960771816
5 svmem(total=270194331648L, available=181568315392L, percent=32.8, used=107641655296L, free=162552676352L, active=88588271616, inactive=13656334336, buffers=345395200L, cached=18670243840)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 231246823 entries, 0 to 231246822
Data columns (total 1 columns):
0 object
dtypes: object(1)
memory usage: 3.4+ GB
None
6 svmem(total=270194331648L, available=181571620864L, percent=32.8, used=107638353920L, free=162555977728L, active=88587603968, inactive=13656334336, buffers=345395200L, cached=18670247936)
7 svmem(total=270194331648L, available=181571620864L, percent=32.8, used=107638353920L, free=162555977728L, active=88587603968, inactive=13656334336, buffers=345395200L, cached=18670247936)
8 svmem(total=270194331648L, available=181571620864L, percent=32.8, used=107638353920L, free=162555977728L, active=88587603968, inactive=13656334336, buffers=345395200L, cached=18670247936)
9 svmem(total=270194331648L, available=183428308992L, percent=32.1, used=105781678080L, free=164412653568L, active=86735921152, inactive=13656334336, buffers=345395200L, cached=18670260224)
10 svmem(total=270194331648L, available=183428308992L, percent=32.1, used=105781678080L, free=164412653568L, active=86735921152, inactive=13656334336, buffers=345395200L, cached=18670260224)
11 svmem(total=270194331648L, available=183427203072L, percent=32.1, used=105782812672L, free=164411518976L, active=86736560128, inactive=13656330240, buffers=345395200L, cached=18670288896)
I even deleted the database connection and called garbage collection.
How could a 10GB table use up 60GB of my memory?
The short answer: python data structures memory overhead.
You have a table with ~231M rows taking ~10GB, so each row has about 4 bytes.
fetchall
translate that into a list of tuples like this:
[('abcd',), ('1234',), ... ]
Your list has ~231M elements and uses ~19GB of memory: on average each tuple uses 8.48 bytes.
$ python
Python 2.7.12 (default, Nov 19 2016, 06:48:10)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
A tuple:
>>> a = ('abcd',)
>>> sys.getsizeof(a)
64
A list of one tuple:
>>> al = [('abcd',)]
>>> sys.getsizeof(al)
80
A list of two tuples:
>>> al2 = [('abcd',), ('1234',)]
>>> sys.getsizeof(al2)
88
A list with 10 tuples:
>>> al10 = [ ('abcd',) for x in range(10)]
>>> sys.getsizeof(al10)
200
A list with 1M tuples:
>>> a_realy_long = [ ('abcd',) for x in range(1000000)]
>>> sys.getsizeof(a_realy_long )
8697472
Almost our number: 8.6 bytes per tuple in the list.
Unfortunately there isn't much you can do here: mysql.connector
chooses the data structure and dict cursor would use even more memory.
If you need to reduce memory usage you must use fetchmany with suitable size argument.
Edit: pd.read_sql
only accepts SQLAlchemy connections. Start by using create_engine
from SQLAlchemy to connect to your database:
from sqlalchemy import create_engine
engine = create_engine('mysql://database')
then call .connect()
on the resulting object:
connection = engine.connect()
Pass that connection to pd.read_sql
:
df = pd.read_sql("select id from test", connection)
This should decrease your memory footprint.
Would you mind posting the memory usage results after you try the above?
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.