简体   繁体   English

为什么通过原始MySQLdb游标通过ORM 5-8x加载SQLAlchemy对象比使用行慢?

[英]Why is loading SQLAlchemy objects via the ORM 5-8x slower than rows via a raw MySQLdb cursor?

I noticed that SQLAlchemy was slow fetching (and ORMing) some data, which was rather fast to fetch using bare bone SQL. 我注意到SQLAlchemy缓慢获取(和ORMing)一些数据,使用裸骨SQL获取相当快。 First off, I created a database with a million records: 首先,我创建了一个包含一百万条记录的数据库:

mysql> use foo
mysql> describe Foo;
+-------+---------+------+-----+---------+-------+
| Field | Type    | Null | Key | Default | Extra |
+-------+---------+------+-----+---------+-------+
| id    | int(11) | NO   | PRI | NULL    |       |
| A     | int(11) | NO   |     | NULL    |       |
| B     | int(11) | NO   |     | NULL    |       |
| C     | int(11) | NO   |     | NULL    |       |
+-------+---------+------+-----+---------+-------+
mysql> SELECT COUNT(*) FROM Foo;
+----------+
| COUNT(*) |
+----------+
|  1000000 |
+----------+
mysql> 

As a crude test, querying all Foo's takes approximately 2 seconds: 作为粗略测试,查询所有Foo大约需要2秒钟:

herbert@dev0 ~ $ date; echo 'use foo; select * from Foo;' | mysql -uroot -pxxx > /dev/null; date
zo apr 20 18:48:49 CEST 2014
zo apr 20 18:48:51 CEST 2014

If I do this in python using MySQLdb this takes a approximately 3 seconds, including the construction of Foo objects: 如果我使用MySQLdb在python中执行此操作,则需要大约3秒钟,包括构造Foo对象:

herbert@dev0 ~ $ python BareORM.py 
query execution time:  0:00:02.198986
total time:  0:00:03.403084

Which is the output of: 哪个是输出:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import MySQLdb
import sys
import time
import datetime

class Foo:
    def __init__(self, a, b, c):
        self.a=a; self.b=b; self.c=c;

try:
    start = datetime.datetime.now()
    con = MySQLdb.connect('localhost', 'root', 'xxx', 'foo')
    cur = con.cursor();

    cur.execute("""SELECT * FROM Foo LIMIT 1000000""")
    print "query execution time: ", datetime.datetime.now()-start
    foos = [];
    for elem in cur:
        foos.append(Foo(elem[1], elem[2], elem[3]))
    con.commit()

except MySQLdb.Error, e:
    print "Error %d: %s" % (e.args[0], e.args[1])
    sys.exit(1)

finally:
    if con: con.close()
    print "total time: ",  datetime.datetime.now()-start

However, using SQLAlchemy to reduce boilerplate code, it needed approximately 25 seconds to do the same job: 但是,使用SQLAlchemy减少样板代码,执行相同的工作大约需要25秒:

herbert@dev0 ~ $ python AlchemyORM.py 
total time:  0:00:24.649279

Using this code: 使用此代码:

import sqlalchemy
import datetime
import MySQLdb

from sqlalchemy import Column, Integer, create_engine
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker, relationship, backref

Base = declarative_base()

class Foo(Base):
    __tablename__ = 'Foo'
    id = Column(Integer, primary_key=True)
    A  = Column(Integer(unsigned=False), nullable=False)
    B  = Column(Integer(unsigned=False), nullable=False)
    C  = Column(Integer(unsigned=False), nullable=False)

engine  = create_engine('mysql+mysqldb://root:xxx@localhost/foo')
Session = sessionmaker(bind=engine)
session = Session()
start = datetime.datetime.now()
foos  = session.query(Foo).limit(1000000).all()
print "total time: ", datetime.datetime.now()-start

Why does SQLAlchemy operate ~10x slower than the bare SQL solution, assuming that SQLAlchemy should do approximately the same thing? 为什么SQLAlchemy的操作比裸SQL解决方案慢10倍,假设SQLAlchemy应该做大致相同的事情? Can I speed things up somehow? 我可以以某种方式加快速度吗?

This is a minimal working example of a more complicated query, which joins several tables using eager loading. 这是一个更复杂的查询的最小工作示例,它使用预先加载来连接多个表。 I was considering just doing simple queries on a single table, and then using dictionaries to create id->object maps and collate one-to-N relations. 我正在考虑在单个表上进行简单查询,然后使用字典创建id->对象映射并整理一对N关系。 But before doing so, I want to be sure that SQLAlchemy is unable to perform better, because writing your own ORM is a bad idea from a software design point of view. 但在此之前,我想确保SQLAlchemy无法更好地执行,因为从软件设计的角度来看,编写自己的ORM是一个坏主意。 Imho, a 2x slowdown would be acceptable (maybe). Imho,2倍减速是可以接受的(也许)。

If you know about other (faster) python-SQL ORM's, or maybe BigTable-alike solutions (that already are the ORM), feel free to mention them as a comment. 如果您了解其他(更快的)python-SQL ORM,或者类似BigTable的解决方案(已经是ORM),请随意将它们作为注释提及。

EDIT: Also tried this with Peewee, which resulted in ~15 s. 编辑:也用Peewee尝试了这个,结果大约15秒。

from peewee import *
import datetime;
database = MySQLDatabase("foo", host="localhost", port=3306, user="root", passwd="xxx")

class Foo(Model):
        id = IntegerField()
        A  = IntegerField()
        B  = IntegerField()
        C  = IntegerField()

        class Meta:
                db_table = 'Foo'
                database = database

start = datetime.datetime.now()
foos = Foo.select()
cnt=0;
for i in foos: cnt=cnt+1
print "total time: ", datetime.datetime.now() - start

EDIT: As a response to Matthias I tried to do the same thing in Java with Hibernate, the result is approximately 8 to 10 seconds, not exactly fast, but a lot faster than 25 seconds. 编辑:作为对Matthias的回应,我尝试在Java中使用Hibernate做同样的事情,结果大约是8到10秒,不是很快,但比25秒快很多。 The code, starting with some classes and ending with some configuration: 代码,从一些类开始,以一些配置结束:

package herbert.hibernateorm;

import java.util.List;

import org.hibernate.Session; 
import org.hibernate.Transaction;
import org.hibernate.SessionFactory;
import org.hibernate.cfg.Configuration;

public class App {
   public static void main(String[] args) throws Exception {
      SessionFactory factory = new Configuration().configure().buildSessionFactory();
      Session session = factory.openSession();
      Transaction tx = session.beginTransaction();
      long start = System.currentTimeMillis();
      List foos = session.createQuery("FROM Foo").list(); 
      System.out.println(foos.size());
      System.out.printf("total time: %d\n", System.currentTimeMillis() - start);
      session.close();
   }
}
package herbert.hibernateorm;

public class Foo {
    private int id, a, b, c;
    public Foo() {}
    public Foo(int A, int B, int C) { this.a=A; this.b=B; this.c=C; }

    public int getId() { return id; }
    public void setId(int id) { this.id = id; }
    public int getA() { return a; }
    public void setA(int a) { this.a = a; }
    public int getB() { return b; }
    public void setB(int b) { this.b = b; }
    public int getC() { return c; }
    public void setC(int c) { this.c = c; }
}

The configuration (hibernate.cfg.xml and hibernate.hbm.xml respectively) 配置(分别是hibernate.cfg.xml和hibernate.hbm.xml)

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE hibernate-configuration PUBLIC "-//Hibernate/Hibernate Configuration DTD 3.0//EN" "http://hibernate.sourceforge.net/hibernate-configuration-3.0.dtd">
<hibernate-configuration>
  <session-factory>
    <property name="hibernate.dialect">org.hibernate.dialect.MySQLDialect</property>
    <property name="hibernate.connection.driver_class">com.mysql.jdbc.Driver</property>
    <property name="hibernate.connection.url">jdbc:mysql://localhost:3306/foo?zeroDateTimeBehavior=convertToNull</property>
    <property name="hibernate.connection.username">root</property>
    <property name="hibernate.connection.password">xxx</property>
    <mapping resource="hibernate.hbm.xml"/>
  </session-factory>
</hibernate-configuration>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE hibernate-mapping PUBLIC "-//Hibernate/Hibernate Mapping DTD 3.0//EN" "http://hibernate.sourceforge.net/hibernate-mapping-3.0.dtd">
<hibernate-mapping>
    <class name="herbert.hibernateorm.Foo" table="Foo" catalog="foo">
        <id name="id" type="int">
            <column name="id" />
            <generator class="assigned" />
        </id>
        <property name="a" type="int">
            <column name="A" not-null="true" />
        </property>
        <property name="b" type="int">
            <column name="B" not-null="true" />
        </property>
        <property name="c" type="int">
            <column name="C" not-null="true" />
        </property>
    </class>
</hibernate-mapping>

And finally the pom file to run it all in maven: 最后是pom文件在maven中运行它:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>herbert</groupId>
    <artifactId>hibernateORM</artifactId>
    <version>1.0-SNAPSHOT</version>
    <packaging>jar</packaging>

    <name>hibernateORM</name>
    <url>http://maven.apache.org</url>
    <repositories>
        <repository>
            <id>unknown-jars-temp-repo</id>
            <name>A temporary repository created by NetBeans for libraries and jars it could not identify. Please replace the dependencies in this repository with correct ones and delete this repository.</name>
            <url>file:${project.basedir}/lib</url>
        </repository>
    </repositories>
    <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    </properties>

    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>3.8.1</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.21</version>
        </dependency>
        <dependency>
            <groupId>org.hibernate</groupId>
            <artifactId>hibernate-core</artifactId>
            <version>4.0.1.Final</version>
        </dependency>
        <dependency>
            <groupId>org.hibernate</groupId>
            <artifactId>hibernate-entitymanager</artifactId>
            <version>4.0.1.Final</version>
        </dependency>
        <dependency>
            <groupId>org.hibernate.common</groupId>
            <artifactId>hibernate-commons-annotations</artifactId>
            <version>4.0.1.Final</version>
        </dependency>   
        <dependency>
            <groupId>nz.ac.waikato.cms.weka</groupId>
            <artifactId>weka-dev</artifactId>
            <version>3.7.10</version>
        </dependency>
        <dependency>
            <groupId>commons-configuration</groupId>
            <artifactId>commons-configuration</artifactId>
            <version>1.9</version>
        </dependency>
        <dependency>
            <groupId>commons-net</groupId>
            <artifactId>commons-net</artifactId>
            <version>3.1</version>
            <classifier>examples</classifier>
        </dependency>
        <dependency>
            <groupId>com.google.code.gson</groupId>
            <artifactId>gson</artifactId>
            <version>2.2.2</version>
        </dependency>
        <dependency>
            <groupId>maven</groupId>
            <artifactId>maven-jetty-plugin</artifactId>
            <version>1.1</version>
            <type>plugin</type>
        </dependency>
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.4</version>
        </dependency>
        <dependency>
                <groupId>com.kenai.nbpwr</groupId>
                <artifactId>org-slf4j-jdk14</artifactId>
                <version>1.6.1-201106101300</version>
                <type>nbm</type>
        </dependency>

    </dependencies>
</project>

Here is the SQLAlchemy version of your MySQL script that performs in four seconds, compared to three for MySQLdb: 以下是MySQL脚本的SQLAlchemy版本,在4秒内执行,而MySQLdb则为3:

from sqlalchemy import Integer, Column, create_engine, MetaData, Table
import datetime

metadata = MetaData()

foo = Table(
    'foo', metadata,
    Column('id', Integer, primary_key=True),
    Column('a', Integer(), nullable=False),
    Column('b', Integer(), nullable=False),
    Column('c', Integer(), nullable=False),
)


class Foo(object):
    def __init__(self, a, b, c):
        self.a = a
        self.b = b
        self.c = c

engine = create_engine('mysql+mysqldb://scott:tiger@localhost/test', echo=True)
start = datetime.datetime.now()

with engine.connect() as conn:
    foos = [
        Foo(row['a'], row['b'], row['c'])
        for row in
        conn.execute(foo.select().limit(1000000)).fetchall()
    ]


print "total time: ", datetime.datetime.now() - start

runtime: 运行:

total time:  0:00:04.706010

Here is a script that uses the ORM to load object rows fully; 这是一个使用ORM完全加载对象行的脚本; by avoiding the creation of a fixed list with all 1M objects at once using yield per, this runs in 13 seconds with SQLAlchemy master (18 seconds with rel 0.9): 通过避免使用yield per一次创建包含所有1M对象的固定列表,使用SQLAlchemy master在13秒内运行(使用rel 0.9时为18秒):

import time
from sqlalchemy import Integer, Column, create_engine, Table
from sqlalchemy.orm import Session
from sqlalchemy.ext.declarative import declarative_base

Base = declarative_base()


class Foo(Base):
    __table__ = Table(
        'foo', Base.metadata,
        Column('id', Integer, primary_key=True),
        Column('a', Integer(), nullable=False),
        Column('b', Integer(), nullable=False),
        Column('c', Integer(), nullable=False),
    )


engine = create_engine('mysql+mysqldb://scott:tiger@localhost/test', echo=True)

sess = Session(engine)

now = time.time()

# avoid using all() so that we don't have the overhead of building
# a large list of full objects in memory
for obj in sess.query(Foo).yield_per(100).limit(1000000):
    pass

print("Total time: %d" % (time.time() - now))

We can then split the difference between these two approaches, and load just individual columns with the ORM: 然后我们可以分割这两种方法之间的差异,并使用ORM加载单个列:

for obj in sess.query(Foo.id, Foo.a, Foo.b, Foo.c).yield_per(100).limit(1000000):
    pass

The above again runs in 4 seconds . 以上再次在4秒内运行。

The comparison of SQLAlchemy Core is the more apt comparison to a raw MySQLdb cursor. SQLAlchemy Core的比较是与原始MySQLdb游标的比较。 If you use the ORM but query for individual columns, it's about four seconds in most recent versions. 如果您使用ORM但查询单个列,则在最新版本中大约需要4秒。

At the ORM level, the speed issues are because creating objects in Python is slow, and the SQLAlchemy ORM applies a large amount of bookkeeping to these objects as it fetches them, which is necessary in order for it to fulfill its usage contract, including unit of work, identity map, eager loading, collections, etc. 在ORM级别,速度问题是因为在Python中创建对象很慢,并且SQLAlchemy ORM在获取这些对象时会对这些对象应用大量簿记,这对于它实现其使用合同是必要的,包括单元工作,身份地图,渴望装载,收藏等

To speed up the query dramatically, fetch individual columns instead of full objects. 要显着加快查询速度,请获取单个列而不是完整对象。 See the techniques at http://docs.sqlalchemy.org/en/latest/faq/performance.html#result-fetching-slowness-orm which describe this. 请参阅http://docs.sqlalchemy.org/en/latest/faq/performance.html#result-fetching-slowness-orm中的技术,其中描述了这一点。

For your comparison to PeeWee, PW is a much simpler system with a lot less features, including that it doesn't do anything with identity maps. 为了与PeeWee进行比较,PW是一个更简单的系统,具有更少的功能,包括它不会对身份映射做任何事情。 Even with PeeWee, about as simple of an ORM as is feasible, it still takes 15 seconds , which is evidence that cPython is really really slow compared to the raw MySQLdb fetch which is in straight C. 即使使用PeeWee,就像ORM一样简单,它仍然需要15秒 ,这证明cPython与直接C中的原始MySQLdb提取相比真的很慢

For comparison to Java, the Java VM is way way way faster than cPython . 为了与Java进行比较,Java VM 比cPython更快 Hibernate is ridiculously complicated, yet the Java VM is extremely fast due to the JIT and even all that complexity ends up running faster. Hibernate 非常复杂,但由于JIT,Java VM非常快,甚至所有复杂性最终都会更快地运行。 If you want to compare Python to Java, use Pypy. 如果要将Python与Java进行比较,请使用Pypy。

SQLAlchemy is complicated. SQLAlchemy很复杂。 It has to deal with converting types to Python which the underlying database does not support natively, tables with inheritance, JOINs, caching the objects, maintaining consistency, translated rows, partial results, and whatnot. 它必须处理将类型转换为基础数据库本身不支持的Python,具有继承的表,JOIN,缓存对象,维护一致性,转换行,部分结果等等。 Check out sqlalchemy/orm/loading.py:instance_processor -- it's insane. 查看sqlalchemy/orm/loading.py:instance_processor - 它太疯狂了。

The solution would be to piece together and compile Python code to process the results of a specific query, like Jinja2 does for templates. 解决方案是拼凑并编译Python代码来处理特定查询的结果,就像Jinja2对模板一样。 So far, nobody has done this work, possibly because the common case is a couple of rows (where this kind of optimization would be pessimal) and people who need to process bulk data do that by hand, like you did. 到目前为止,没有人完成这项工作,可能是因为常见的情况是几行(这种优化会产生悲观),需要处理批量数据的人就像你那样手工完成。

This is not an answer to my question, but may help the general public with speed issues on large data sets. 这不是我的问题的答案,但可以帮助公众解决大型数据集的速度问题。 I found that selecting a million records can typically be done in about 3 seconds, however JOINS may slow down the process. 我发现选择一百万条记录通常可以在大约3秒内完成,但是JOINS可能会减慢这个过程。 In this case that one has approximately 150k Foo's which has a 1-many relation to 1M Bars, then selecting those using a JOIN may be slow as each Foo is returned approximately 6.5 times. 在这种情况下,一个人有大约150k Foo,它与1M Bars有1-many关系,那么选择那些使用JOIN的那些可能会很慢,因为每个Foo返回大约6.5次。 I found that selecting both tables seperately and joining them using dicts in python is approximately 3 times faster than SQLAlchemy (approx 25 sec) and 2 times faster than 'bare' python code using joins(approx 17 sec). 我发现单独选择两个表并使用python中的dicts连接它们大约比SQLAlchemy快3倍(约25秒),比使用连接的'bare'python代码快约2倍(约17秒)。 The code took 8 sec in my use case. 在我的用例中代码花了8秒。 Selecting 1M records without relations, like the Bar-example above, took 3 seconds. 选择没有关系的1M记录(如上面的Bar示例)需要3秒钟。 I used this code: 我用过这段代码:

#!/usr/bin/python
# -*- coding: utf-8 -*-

import MySQLdb
import sys
import time
import datetime
import inspect
from operator import itemgetter, attrgetter

# fetch all objects of class Class, where the fields are determined as the
# arguments of the __init__ constructor (not flexible, but fairly simple ;))
def fetch(Class, cursor, tablename, ids=["id"], where=None):
    arguments = inspect.getargspec(Class.__init__).args; del arguments[0];
    fields = ", ".join(["`" + tablename + "`.`" + column + "`" for column in arguments])
    sql = "SELECT " + fields + " FROM `" + tablename + "`"
    if where != None: sql = sql + " WHERE " + where
    sql=sql+";"
    getId = itemgetter(*[arguments.index(x) for x in ids])
    elements = dict()

    cursor.execute(sql)
    for record in cursor:
        elements[getId(record)] = Class(*record)
    return elements

# attach the objects in dict2 to dict1, given a 1-many relation between both
def merge(dict1, fieldname, dict2, ids):
    idExtractor = attrgetter(*ids)
    for d in dict1: setattr(dict1[d], fieldname, list())
    for d in dict2:
        dd = dict2[d]
        getattr(dict1[idExtractor(dd)], fieldname).append(dd)

# attach dict2 objects to dict1 objects, given a 1-1 relation
def attach(dict1, fieldname, dict2, ids):
    idExtractor = attrgetter(*ids)
    for d in dict1: dd=dict1[d]; setattr(dd, fieldname, dict2[idExtractor(dd)])

It helped me speed up my querying, however I am more than happy to hear from the experts about possible improvements to this approach. 它帮助我加快了查询速度,但是我很高兴听到专家们对这种方法的可能改进。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM