简体   繁体   中英

Does the number of columns affect the speed of sqlalchemy?

I created two tables with sqlalchemy (python 2.7), the database is mysql 5.5. The following is my code:

engine = create_engine('mysql://root:123@localhost/test')

metadata = MetaData()

conn = engin.connect()

# For table 1:

columns = []

for i in xrange(100):

    columns.append(Column('c%d' % i, TINYINT, nullable = False, server_default = '0'))
    columns.append(Column('d%d' % i, SmallInteger, nullable = False, server_default = '0'))

user = Table('user', metadata, *columns)
# So user has 100 tinyint columns and 100 smallint columns.

# For table 2:

user2 = Table('user2', metadata,

        Column('c0', BINARY(100), nullable = False, server_default='\0'*100),
        Column('d0', BINARY(200), nullable = False, server_default='\0'*200),
)

# user2 has two columns contains 100 bytes and 200 bytes respectively. 

I then inserted 4000 rows into each table. Since these two tables have same row length, I
expect the select speed will be almost the same. I ran the following test code:

s1 = select([user]).compile(engine)

s2 = select([user2]).compile(engine)

t1 = time()

result = conn.execute(s1).fetchall()

print 't1:', time() - t1 

t2 = time()

result = conn.execute(s2).fetchall()

print 't2', time() - t2 

The result is :

t1: 0.5120000

t2: 0.0149999

Does this means the number of columns in table will dramatically affect the performance of SQLAlchemy? Thank you in advance!

Does this means the number of columns in table will dramatically affect the performance of SQLAlchemy?

well thats a tough one, and it probably depends more on the underlying SQL engine, MySQL in this case, then actually sqlalchemy , which is nothing more than a way to interact with different db engines while using the same interface.

SQLAlchemy is the Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL.

It provides a full suite of well known enterprise-level persistence patterns, designed for efficient and high-performing database access, adapted into a simple and Pythonic domain language.

I actually ran some tests ...

import timeit

setup = """
from sqlalchemy import create_engine, MetaData, select, Table, Column
from sqlalchemy.dialects.sqlite import BOOLEAN, SMALLINT, VARCHAR
engine = create_engine('sqlite://', echo = False)
metadata = MetaData()
conn = engine.connect()
columns = []

for i in xrange(100):
    columns.append(Column('c%d' % i, VARCHAR(1), nullable = False, server_default = '0'))
    columns.append(Column('d%d' % i, VARCHAR(2), nullable = False, server_default = '00'))  


user = Table('user', metadata, *columns)
user.create(engine)
conn.execute(user.insert(), [{}] * 4000)

user2 = Table('user2', metadata, Column('c0', VARCHAR(100), nullable = False, server_default = '0' * 100),  \
                                 Column('d0', VARCHAR(200), nullable = False, server_default = '0' * 200))
user2.create(engine)
conn.execute(user2.insert(), [{}] * 4000)
"""

many_columns = """
s1 = select([user]).compile(engine)
result = conn.execute(s1).fetchall()
"""

two_columns = """
s2 = select([user2]).compile(engine)
result = conn.execute(s2).fetchall()
"""

raw_many_columns = "res = conn.execute('SELECT * FROM user').fetchall()"
raw_two_columns = "res = conn.execute('SELECT * FROM user2').fetchall()"

timeit.Timer(two_columns, setup).timeit(number = 1)
timeit.Timer(raw_two_columns, setup).timeit(number = 1)
timeit.Timer(many_columns, setup).timeit(number = 1)
timeit.Timer(raw_many_columns, setup).timeit(number = 1)

>>> timeit.Timer(two_columns, setup).timeit(number = 1)
0.010751008987426758
>>> timeit.Timer(raw_two_columns, setup).timeit(number = 1)
0.0099620819091796875
>>> timeit.Timer(many_columns, setup).timeit(number = 1)
0.23563408851623535
>>> timeit.Timer(raw_many_columns, setup).timeit(number = 1)
0.21881699562072754

I did find this:
http://www.mysqlperformanceblog.com/2009/09/28/how-number-of-columns-affects-performance/

which was kind of interesting though he used max for testing ...

I really do love sqlalchemy, so I decided to compare it using pythons own sqlite3 module

import timeit
setup = """
import sqlite3
conn = sqlite3.connect(':memory:')
c = conn.cursor()

c.execute('CREATE TABLE user (%s)' %\
          ("".join(("c%i VARCHAR(1) DEFAULT '0' NOT NULL, d%i VARCHAR(2) DEFAULT '00' NOT NULL," % (index, index) for index in xrange(99))) +\
           "c99 VARCHAR(1) DEFAULT '0' NOT NULL, d99 VARCHAR(2) DEFAULT '0' NOT NULL"))

c.execute("CREATE TABLE user2 (c0 VARCHAR(100) DEFAULT '%s' NOT NULL, d0 VARCHAR(200) DEFAULT '%s' NOT NULL)" % ('0'* 100, '0'*200))

conn.commit()
c.executemany('INSERT INTO user VALUES (%s)' % ('?,' * 199 + '?'), [('0',) * 200] * 4000)
c.executemany('INSERT INTO user2 VALUES (?,?)', [('0'*100, '0'*200)] * 4000)
conn.commit()
"""

many_columns = """
r = c.execute('SELECT * FROM user')
all = r.fetchall()
"""

two_columns = """
r2 = c.execute('SELECT * FROM user2')
all = r2.fetchall()
"""

timeit.Timer(many_columns, setup).timeit(number = 1)
timeit.Timer(two_columns, setup).timeit(number = 1)

>>> timeit.Timer(many_columns, setup).timeit(number = 1)
0.21009302139282227
>>> timeit.Timer(two_columns, setup).timeit(number = 1)
0.0083379745483398438

and came up with the same result, so I really do think its a database implementation not a sqlalchemy issue.

import timeit

setup = """
from sqlalchemy import create_engine, MetaData, select, Table, Column
from sqlalchemy.dialects.sqlite import BOOLEAN, SMALLINT, VARCHAR
engine = create_engine('sqlite://', echo = False)
metadata = MetaData()
conn = engine.connect()
columns = []

for i in xrange(100):
    columns.append(Column('c%d' % i, VARCHAR(1), nullable = False, server_default = '0'))
    columns.append(Column('d%d' % i, VARCHAR(2), nullable = False, server_default = '00'))


user = Table('user', metadata, *columns)
user.create(engine)

user2 = Table('user2', metadata, Column('c0', VARCHAR(100), nullable = False, server_default = '0' * 100),  \
                                 Column('d0', VARCHAR(200), nullable = False, server_default = '0' * 200))
user2.create(engine)
"""

many_columns = """
conn.execute(user.insert(), [{}] * 4000)
"""

two_columns = """
conn.execute(user2.insert(), [{}] * 4000)
"""

>>> timeit.Timer(two_columns, setup).timeit(number = 1)
0.017949104309082031
>>> timeit.Timer(many_columns, setup).timeit(number = 1)
0.047809123992919922

testing with sqlite3 module.

import timeit
setup = """
import sqlite3
conn = sqlite3.connect(':memory:')
c = conn.cursor()

c.execute('CREATE TABLE user (%s)' %\
    ("".join(("c%i VARCHAR(1) DEFAULT '0' NOT NULL, d%i VARCHAR(2) DEFAULT '00' NOT NULL," % (index, index) for index in xrange(99))) +\
            "c99 VARCHAR(1) DEFAULT '0' NOT NULL, d99 VARCHAR(2) DEFAULT '0' NOT NULL"))

c.execute("CREATE TABLE user2 (c0 VARCHAR(100) DEFAULT '%s' NOT NULL, d0 VARCHAR(200) DEFAULT '%s' NOT NULL)" % ('0'* 100, '0'*200))
conn.commit()
"""

many_columns = """
c.executemany('INSERT INTO user VALUES (%s)' % ('?,' * 199 + '?'), [('0', '00') * 100] * 4000)
conn.commit()
"""

two_columns = """
c.executemany('INSERT INTO user2 VALUES (?,?)', [('0'*100, '0'*200)] * 4000)
conn.commit()
"""

timeit.Timer(many_columns, setup).timeit(number = 1)
timeit.Timer(two_columns, setup).timeit(number = 1)

>>> timeit.Timer(many_columns, setup).timeit(number = 1)
0.14044189453125
>>> timeit.Timer(two_columns, setup).timeit(number = 1)
0.014360189437866211
>>>    

Samy.vilar's answer is excellent. But one key thing to remember is that the number of columns will have an impact on the performance of any database and any ORM. The more columns you have, the more data is being accessed from disk and transferred.

Also, depending on the query and table structures, adding more columns could change a query from being covered by an index to being forced to access the base table, which can add substantial time under certain databases and certain circumstances.

I have only played with SQLAlchemy a little bit, but as a DBA I generally advise the developers I work with to only query the columns that they will need and to avoid use of "select *" in production code, both because it is likely to contain more columns than needed and because it makes the code more brittle in the face of potential columns being added to the table/view.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM