简体   繁体   中英

Loading Dask dataframe with SQLAlchemy fails

I'm trying to load a Dask dataframe with SQLAlchemy using dd.read_sql_query . I define a table with one of the columns balance_date type DateTime (in the database is type DATE):

class test_loans(Base):
      __tablename__ = 'test_loans'
      annual_income = Column(Float)
      balance = Column(Float)
      balance_date = Column(DateTime)  # the type of the column is DateTime
      cust_segment = Column(String)
      total_amount_paid = Column(Float)
      the_key = Column(Integer)
      __table_args__ = (PrimaryKeyConstraint(the_key),)

Problem is that the dd.read_sql_query fails, as it says that the col_index is not type numeric or date but object :

stmt = select([ test_loans.balance_date, test_loans.total_amount_paid ]) 
ddf = dd.read_sql_query(stmt, con=con, index_col='balance_date', npartitions=3)

I get

TypeError: Provided index column is of type "object".  If divisions is
not provided the index column type must be numeric or datetime.

How to fix this? Is this a defect?

The problem is solved by casting the column as DateTime in the SQLAlchemy select statement.

It is a bug in dask.dataframe, when no limits are given, it fetches min and max values for the index with pandas.read_sql which does not parse dates automatically, therefore you end up with this min/max df having object dtypes and that dtype is reused for the division, which cannot accept it.

Here is the culprit code: https://github.com/dask/dask/blob/8b95f983c232c1bd628e9cba0695d3ef229d290b/dask/dataframe/io/sql.py#L130

NB. I filled the github issue: https://github.com/dask/dask/issues/9383

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM