简体   繁体   中英

Airflow error with pandas: AttributeError: 'Pendulum' object has no attribute 'nanosecond'

I have a pandas.DataFrame df with df.index which yeilds something like this:

DatetimeIndex(['2014-10-06 00:55:11.357899904',
               '2014-10-06 00:56:39.046799898',
               '2014-10-06 00:56:39.057499886',
               '2014-10-06 00:56:40.684299946',
               '2014-10-06 00:56:41.115299940',
               '2014-10-06 01:03:52.764300108',
               '2014-10-06 01:21:18.448499918',
               '2014-10-06 01:21:18.457200050',
               '2014-10-06 01:21:18.584199905',
               '2014-10-06 01:21:18.594700098',
               ...
               '2014-11-05 00:25:47.996000051',
               '2014-11-05 00:56:45.081799984',
               '2014-11-05 00:56:45.096899986',
               '2014-11-05 05:50:57.639699936',
               '2014-11-05 06:08:56.365000010',
               '2014-11-05 06:11:20.519099950',
               '2014-11-05 06:15:03.470400095',
               '2014-11-05 06:15:03.981600046',
               '2014-11-05 06:25:31.514300108',
               '2014-11-05 06:25:59.310400009'],
              dtype='datetime64[ns]', name='time', length=1000, freq=None)

I am running a DAG on airflow, which stops at the following line df.loc[start_date:end_date] , saying that:

AttributeError: 'Pendulum' object has no attribute 'nanosecond'

I cannot reproduce the error without running the code in Airflow. The same code runs just fine without Airflow.

The start_date is the Airflow macro execution_date and end_date is the next_execution_date .

I guess the issues is to do with the date-time dtype of the df not being compatable with the ones from the start_date & end_date , but I have no idea how to address it.

I tried removing time zones, changing the dtype but nothing worked.

After some searching, I found the source of the problem and a solution.

the problem

The issue is caused by the two macros passed down from Airflow:

  • start_date , which is the execution_date macro

  • end_date , which is the next_execution_date macro

The types of them are pendulum.datetime , and not datetime.datetime , as the Airflow documentation says. This causes the clash with pandas.DataFrame .

pandas and pendulum currently don't work well together and the problem is well described in this StackOverflow asnwer.

the solution

The solution seesm to convery the start_date and end_date from pendulum.datetime to datetime.datetime .

For this I created this simple function, which converts from to string beofore converting to datetime.datetime . I am sure they are better ways to do it, but this was quite simple and safe, hence why I used it.

Here is the function itself:

def pendulum_to_datetime(pendulum_date):
    """
    Convert pendulum to datetime format.

    The conversion is done from pendulum -> string -> dateime.

    Args:
        pendulum_date (pendulum): The date you wish to convert.

    Returns:
        (datetime) The converted date.
    """
    fmt = '%Y-%m-%dT%H:%M:%S%z'
    string_date = pendulum_date.strftime(fmt)
    return datetime.strptime(string_date, fmt)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM