Loading Dask dataframe with SQLAlchemy fails

Question

I'm trying to load a Dask dataframe with SQLAlchemy using dd.read_sql_query . I define a table with one of the columns balance_date type DateTime (in the database is type DATE):

class test_loans(Base):
      __tablename__ = 'test_loans'
      annual_income = Column(Float)
      balance = Column(Float)
      balance_date = Column(DateTime)  # the type of the column is DateTime
      cust_segment = Column(String)
      total_amount_paid = Column(Float)
      the_key = Column(Integer)
      __table_args__ = (PrimaryKeyConstraint(the_key),)

Problem is that the dd.read_sql_query fails, as it says that the col_index is not type numeric or date but object :

stmt = select([ test_loans.balance_date, test_loans.total_amount_paid ]) 
ddf = dd.read_sql_query(stmt, con=con, index_col='balance_date', npartitions=3)

I get

TypeError: Provided index column is of type "object".  If divisions is
not provided the index column type must be numeric or datetime.

How to fix this? Is this a defect?

Answer 1

The problem is solved by casting the column as DateTime in the SQLAlchemy select statement.

Answer 2

It is a bug in dask.dataframe, when no limits are given, it fetches min and max values for the index with pandas.read_sql which does not parse dates automatically, therefore you end up with this min/max df having object dtypes and that dtype is reused for the division, which cannot accept it.

Here is the culprit code: https://github.com/dask/dask/blob/8b95f983c232c1bd628e9cba0695d3ef229d290b/dask/dataframe/io/sql.py#L130

NB. I filled the github issue: https://github.com/dask/dask/issues/9383

Loading Dask dataframe with SQLAlchemy fails

Question

2 answers

solution1
1 2022-08-13 14:09:05

solution2
0 2022-08-13 13:17:29

Loading Dask dataframe with SQLAlchemy fails

Question

2 answers

solution1 1 2022-08-13 14:09:05

solution2 0 2022-08-13 13:17:29

solution1
1 2022-08-13 14:09:05

solution2
0 2022-08-13 13:17:29