简体   繁体   中英

Map function taking too much time (Pandas DataFrame)

I have a pandas Dataframe with the following shape: 12.000.000 x 2 (rows x columns) I need to apply a map function, however, it is taking so much time when it has to just compare every date of column 1 to a given date, for example, today.

Example of the DataFrame

╔════════════╦══════════╗
║    Col1    ║   Col2   ║
╠════════════╬══════════╣
║ 2019-03-19 ║        1 ║
║ 2019-03-20 ║        2 ║
║ 2019-05-15 ║        3 ║
║ 2019-07-15 ║        4 ║
║ ...        ║          ║
║ 2019-10-20 ║ 12000000 ║
╚════════════╩══════════╝

Example of the code

import pandas as pd
from datetime import datetime

df = pd.read_csv('path_of_file.csv')
today = datetime.now()
df['output'] = df['Col1'].apply(lambda x: 1 if x > today else 0) 

Am I missing something? Could it be improved? Thank you!

EDIT - See wwii solution

wwii solution is the clear winner out of the OP's and mine.

His solution runs 2x faster than my own:

df['output'] = 1 * (df['Col1'] > today)

It's a pretty neat one too, as all you're doing is multiplying 1 with either 1 or 0, resulting in the truth value of comparing the date column with today's date.


This was a really interesting question, so I ran some tests on my end.

I created an empty dataframe with 1 million rows of dates.

starting_date = datetime(200, 1, 1, 00, 00)
end_date = datetime(3000,1, 1, 00, 00)
index = 1

date_values = []

def daterange(start_date, end_date):
    for n in range(int ((end_date - start_date).days)):
        yield start_date + timedelta(n)

date_values = [_date for _date in daterange(starting_date, end_date)]

date_col = {'Col1': date_values}
df = pd.DataFrame(date_col)

We're going into the future boys.

Now, the two tests I ran compared the function run time of the solution the OP provided, and the solution I posted below.

We are assuming the dates are in order

Test 1 - OP's solution

start_time = time.time()

df['output'] = df['Col1'].apply(lambda x: 1 if x > today else 0) 

print("--- %s seconds ---" % (time.time() - start_time))

Test 2 - My solution

start_time = time.time()

df['output'] = 1

df.loc[df['Col1'] < today, 'output'] = 0

print("--- %s seconds ---" % (time.time() - start_time))

The results

After running each function 10 times, the second solution won each time. Why? Honestly I have no idea.

I think we can make a good guess that under the hood, pandas is not performing a linear search when assigning a constant value to a column based on a condition, as demonstrated in the 2nd solution.

Soltuion 1
--- 0.36346006393432617 seconds ---
Solution 2
--- 0.13942289352416992 seconds ---
Soltuion 1
--- 0.4605379104614258 seconds ---
Solution 2
--- 0.12388873100280762 seconds ---
Soltuion 1
--- 0.34688305854797363 seconds ---
Solution 2
--- 0.0912778377532959 seconds ---
Soltuion 1
--- 0.2879600524902344 seconds ---
Solution 2
--- 0.08435988426208496 seconds ---
Soltuion 1
--- 0.3161609172821045 seconds ---
Solution 2
--- 0.0965569019317627 seconds ---
Soltuion 1
--- 0.31951212882995605 seconds ---
Solution 2
--- 0.08857107162475586 seconds ---
Soltuion 1
--- 0.2996959686279297 seconds ---
Solution 2
--- 0.16647815704345703 seconds ---
Soltuion 1
--- 0.5074219703674316 seconds ---
Solution 2
--- 0.13281011581420898 seconds ---
Soltuion 1
--- 0.3716299533843994 seconds ---
Solution 2
--- 0.0970299243927002 seconds ---
Soltuion 1
--- 0.29851794242858887 seconds ---
Solution 2
--- 0.08089780807495117 seconds ---

Something to consider - the dates in both tests are in order. What happens if you receive them in complete, random order?

We first randomize the dataset:

df = df.sample(frac=1)

Then run the exact same tests.

Soltuion 1
--- 0.6548967361450195 seconds ---
Solution 2
--- 0.22769808769226074 seconds ---
Soltuion 1
--- 0.7096188068389893 seconds ---
Solution 2
--- 0.28220510482788086 seconds ---
Soltuion 1
--- 0.7588798999786377 seconds ---
Solution 2
--- 0.25870585441589355 seconds ---
Soltuion 1
--- 0.6285257339477539 seconds ---
Solution 2
--- 0.3373727798461914 seconds ---
Soltuion 1
--- 0.7623891830444336 seconds ---
Solution 2
--- 0.18880391120910645 seconds ---
Soltuion 1
--- 0.5125689506530762 seconds ---
Solution 2
--- 0.23384499549865723 seconds ---
Soltuion 1
--- 0.6188468933105469 seconds ---
Solution 2
--- 0.25000977516174316 seconds ---
Soltuion 1
--- 0.6692302227020264 seconds ---
Solution 2
--- 0.5207180976867676 seconds ---
Soltuion 1
--- 1.2534172534942627 seconds ---
Solution 2
--- 0.2665679454803467 seconds ---
Soltuion 1
--- 0.6374101638793945 seconds ---
Solution 2
--- 0.2108619213104248 seconds ---

The solution

Since all you're doing is checking if the date is less than today's date, then create a new column and add a constant of either 1 or 0.

Lets first add the constant to the column.

df['Output'] = 1

Now, all we have to do is find the point where the date is less than the current date.

First though, we should change the date type of Col1 to a datetime, to make sure we can do proper comparisons.

df['Col1'] = pd.to_datetime(df['Col1'], format="%Y-%M-%d)

Then, we look through every date that's less than today, and change the output to 0.

df.loc[df['Col1'] < today.date(), 'Output'] = 0

While we're still awaiting some more information on the problem, here is what I have so far:

import pandas as pd


df = pd.DataFrame(
    data={
        "col_1": ["2019-03-19", "2019-03-20", "2030-01-01", "2019-05-15", "2019-07-15"],
        "col_2": [1, 2, 3, 4, 5],
    }
)

df["col_1"] = pd.to_datetime(df["col_1"], infer_datetime_format=True, utc=True)

print(df, end='\n\n')

curr_time = pd.Timestamp.utcnow()

print(curr_time, end='\n\n')

df["col_3"] = df["col_1"] > curr_time

print(df)

Output:

                      col_1  col_2
0 2019-03-19 00:00:00+00:00      1
1 2019-03-20 00:00:00+00:00      2
2 2030-01-01 00:00:00+00:00      3
3 2019-05-15 00:00:00+00:00      4
4 2019-07-15 00:00:00+00:00      5

2020-02-12 02:11:37.212849+00:00

                      col_1  col_2  col_3
0 2019-03-19 00:00:00+00:00      1  False
1 2019-03-20 00:00:00+00:00      2  False
2 2030-01-01 00:00:00+00:00      3   True
3 2019-05-15 00:00:00+00:00      4  False
4 2019-07-15 00:00:00+00:00      5  False

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM