简体   繁体   中英

SQLAlchemy hangs during insert while querying [INFORMATION_SCHEMA].[TABLES]

I have a Python process that uses SQLAlchemy to insert some data into a MS SQL Server DB. When the Python process runs it hangs during the insert. I turned on SQLAlchemy logging to get some more information. I found that it hangs at this point where SQLAlchemy seems to be requesting table schema info about the entire DB:

2020-10-30 08:12:07 [11444:6368] sqlalchemy.engine.base.Engine._execute_context(base.py:1235) INFO: SELECT [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME] 
FROM [INFORMATION_SCHEMA].[TABLES] 
WHERE [INFORMATION_SCHEMA].[TABLES].[TABLE_SCHEMA] = CAST(? AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[TABLES].[TABLE_TYPE] = CAST(? AS NVARCHAR(max)) ORDER BY [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME]
2020-10-30 08:12:07 [11444:6368] sqlalchemy.engine.base.Engine._execute_context(base.py:1240) INFO: ('dbo', 'BASE TABLE')

I have other "stuff" going on in the DB at this time, including some open transactions and my guess is that for whatever reason querying [INFORMATION_SCHEMA].[TABLES] creates some deadlock or blocks somehow.

I've also read ( here ) that [INFORMATION_SCHEMA].[TABLES] is a view that cannot cause a deadlock which would contradict my guess of what is causing this issue.

My question is: Can I alter the configuration/settings of SQLAlchemy so that it does not make this query in the first place?

UPDATE 1 : The Python code for the insert is like this:

with sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params).connect() as connection:
    # df is a Pandas DataFrame
    df.to_sql(name=my_table, con=connection, if_exists='append', index=False)

Note that the code works without any problems when I run the Python script during other times of the day when I don't have those other DB transactions going on. In those cases, the log continues immediately like this, listing all the tables in the DB:

2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine._execute_context(base.py:1235) INFO: SELECT [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME] 
FROM [INFORMATION_SCHEMA].[TABLES] 
WHERE [INFORMATION_SCHEMA].[TABLES].[TABLE_SCHEMA] = CAST(? AS NVARCHAR(max)) AND [INFORMATION_SCHEMA].[TABLES].[TABLE_TYPE] = CAST(? AS NVARCHAR(max)) ORDER BY [INFORMATION_SCHEMA].[TABLES].[TABLE_NAME]
2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine._execute_context(base.py:1240) INFO: ('dbo', 'BASE TABLE')
2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine._init_metadata(result.py:810) DEBUG: Col ('TABLE_NAME',)
2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine.process_rows(result.py:1260) DEBUG: Row ('t_table1',)
2020-10-30 08:13:03 [11444:6368] sqlalchemy.engine.base.Engine.process_rows(result.py:1260) DEBUG: Row ('t_table2',)
...

UPDATE 2: Apparently when a table or other object is created in an open transaction and not committed yet, querying [INFORMATION_SCHEMA].[TABLES] will get blocked ( source ). Is anyone familiar with the internals of SQLAlchemy to suggest how to prevent it from making this query in the first place?

UPDATE 3 : After posting this issue on the SQLAlchemy github ( issue link) the SQLAlchemy devs confirmed that the query of [INFORMATION_SCHEMA].[TABLES] is in fact being caused by the Pandas function to_sql() .

So, my new question is does anyone know how to disable this behavior in the Pandas to_sql() function? I looked over the documentation and could not find anything that would seem to help.

I'm not very familiar with SQLAlchemy, but I can tell you about the Pandas side of this issue.

Pandas automatically creates a new table if the table doesn't exist. The way it figures out whether the table exists is that it calls has_table() in SQL Alchemy. The way has_table() works is that it queries the information schema. (At least, it works that way in MySQL and MSSQL.)

Implementation details

Here's what I found tracing the logic for this in Pandas and SQLAlchemy. We start in pandas/io/sql.py, inside to_sql() .

        table = SQLTable(
            name,
            self,
            frame=frame,
            index=index,
            if_exists=if_exists,
            index_label=index_label,
            schema=schema,
            dtype=dtype,
        )
        table.create()

SQLTable.create() is defined here:

class SQLTable(PandasObject):
    [...]
    def create(self):
        if self.exists():
            if self.if_exists == "fail":
                raise ValueError(f"Table '{self.name}' already exists.")
            elif self.if_exists == "replace":
                self.pd_sql.drop_table(self.name, self.schema)
                self._execute_create()
            elif self.if_exists == "append":
                pass
            else:
                raise ValueError(f"'{self.if_exists}' is not valid for if_exists")
        else:
            self._execute_create()

Notice that it calls exists() unconditionally. Inside SQLTable.exists() , you'll find this:

    def exists(self):
        return self.pd_sql.has_table(self.name, self.schema)

This eventually calls has_table() in SQLAlchemy: https://docs.sqlalchemy.org/en/13/core/internals.html#sqlalchemy.engine.default.DefaultDialect.has_table

For MSSQL, this is implemented in sqlalchemy/dialects/mssql/base.py in SQLAlchemy:

    @_db_plus_owner
    def has_table(self, connection, tablename, dbname, owner, schema):
        if tablename.startswith("#"):  # temporary table
            [...]
        else:
            tables = ischema.tables

            s = sql.select(tables.c.table_name).where(
                sql.and_(
                    tables.c.table_type == "BASE TABLE",
                    tables.c.table_name == tablename,
                )
            )

            if owner:
                s = s.where(tables.c.table_schema == owner)

            c = connection.execute(s)

            return c.first() is not None

( ischema is an abbreviation of information_schema, and this code is running a select on that table.)

How to fix this

I don't see a good, simple way to fix this. Pandas assumes that has_table() is a cheap operation. MSSQL doesn't follow that assumption. No matter what if_exists is set to, Pandas will call has_table() during to_sql() .

I can think of a hacky way of doing this, though. If you were to monkey-patch pandas.io.sql.SQLTable.create() so that it's a no-op, then you could trick Pandas into thinking that the table already exists. The downside of this is that Pandas won't automatically create tables.

execute set transaction isolation level read uncommitted before calling to_sql/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM