Map function 花费太多时间（Pandas DataFrame）

Question

I have a pandas Dataframe with the following shape: 12.000.000 x 2 (rows x columns) I need to apply a map function, however, it is taking so much time when it has to just compare every date of column 1 to a given date, for example, today.我有一个 pandas Dataframe，形状如下： 12.000.000 x 2（行 x 列）日期，例如，今天。

Example of the DataFrame示例 DataFrame

╔════════════╦══════════╗
║    Col1    ║   Col2   ║
╠════════════╬══════════╣
║ 2019-03-19 ║        1 ║
║ 2019-03-20 ║        2 ║
║ 2019-05-15 ║        3 ║
║ 2019-07-15 ║        4 ║
║ ...        ║          ║
║ 2019-10-20 ║ 12000000 ║
╚════════════╩══════════╝

Example of the code代码示例

import pandas as pd
from datetime import datetime

df = pd.read_csv('path_of_file.csv')
today = datetime.now()
df['output'] = df['Col1'].apply(lambda x: 1 if x > today else 0)

Am I missing something?我错过了什么吗？ Could it be improved?可以改进吗？ Thank you!谢谢！

Answer 1

EDIT - See wwii solution编辑 - 查看二战解决方案

wwii solution is the clear winner out of the OP's and mine.第二次世界大战解决方案是 OP 和我的明显赢家。

His solution runs 2x faster than my own:他的解决方案运行速度比我自己的快 2 倍：

df['output'] = 1 * (df['Col1'] > today)

It's a pretty neat one too, as all you're doing is multiplying 1 with either 1 or 0, resulting in the truth value of comparing the date column with today's date.这也是一个非常简洁的方法，因为您所做的只是将 1 乘以 1 或 0，结果是将日期列与今天的日期进行比较的真值。

This was a really interesting question, so I ran some tests on my end.这是一个非常有趣的问题，所以我做了一些测试。

I created an empty dataframe with 1 million rows of dates.我创建了一个包含 100 万行日期的空 dataframe。

starting_date = datetime(200, 1, 1, 00, 00)
end_date = datetime(3000,1, 1, 00, 00)
index = 1

date_values = []

def daterange(start_date, end_date):
    for n in range(int ((end_date - start_date).days)):
        yield start_date + timedelta(n)

date_values = [_date for _date in daterange(starting_date, end_date)]

date_col = {'Col1': date_values}
df = pd.DataFrame(date_col)

We're going into the future boys.我们要进入未来的男孩们。

Now, the two tests I ran compared the function run time of the solution the OP provided, and the solution I posted below.现在，我运行的两个测试比较了 OP 提供的解决方案的 function 运行时间，以及我在下面发布的解决方案。

We are assuming the dates are in order我们假设日期是有序的

Test 1 - OP's solution测试 1 - OP 的解决方案

start_time = time.time()

df['output'] = df['Col1'].apply(lambda x: 1 if x > today else 0) 

print("--- %s seconds ---" % (time.time() - start_time))

Test 2 - My solution测试 2 - 我的解决方案

start_time = time.time()

df['output'] = 1

df.loc[df['Col1'] < today, 'output'] = 0

print("--- %s seconds ---" % (time.time() - start_time))

The results结果

After running each function 10 times, the second solution won each time.每个 function 运行 10 次后，每次都是第二个解决方案获胜。 Why?为什么？ Honestly I have no idea.老实说，我不知道。

I think we can make a good guess that under the hood, pandas is not performing a linear search when assigning a constant value to a column based on a condition, as demonstrated in the 2nd solution.我认为我们可以很好地猜测，pandas 在根据条件为列分配常量值时并未执行线性搜索，如第二个解决方案所示。

Soltuion 1
--- 0.36346006393432617 seconds ---
Solution 2
--- 0.13942289352416992 seconds ---
Soltuion 1
--- 0.4605379104614258 seconds ---
Solution 2
--- 0.12388873100280762 seconds ---
Soltuion 1
--- 0.34688305854797363 seconds ---
Solution 2
--- 0.0912778377532959 seconds ---
Soltuion 1
--- 0.2879600524902344 seconds ---
Solution 2
--- 0.08435988426208496 seconds ---
Soltuion 1
--- 0.3161609172821045 seconds ---
Solution 2
--- 0.0965569019317627 seconds ---
Soltuion 1
--- 0.31951212882995605 seconds ---
Solution 2
--- 0.08857107162475586 seconds ---
Soltuion 1
--- 0.2996959686279297 seconds ---
Solution 2
--- 0.16647815704345703 seconds ---
Soltuion 1
--- 0.5074219703674316 seconds ---
Solution 2
--- 0.13281011581420898 seconds ---
Soltuion 1
--- 0.3716299533843994 seconds ---
Solution 2
--- 0.0970299243927002 seconds ---
Soltuion 1
--- 0.29851794242858887 seconds ---
Solution 2
--- 0.08089780807495117 seconds ---

Something to consider - the dates in both tests are in order.需要考虑的事情 - 两次测试的日期都是有序的。 What happens if you receive them in complete, random order?如果您以完整、随机的顺序收到它们，会发生什么情况？

We first randomize the dataset:我们首先随机化数据集：

df = df.sample(frac=1)

Then run the exact same tests.然后运行完全相同的测试。

Soltuion 1
--- 0.6548967361450195 seconds ---
Solution 2
--- 0.22769808769226074 seconds ---
Soltuion 1
--- 0.7096188068389893 seconds ---
Solution 2
--- 0.28220510482788086 seconds ---
Soltuion 1
--- 0.7588798999786377 seconds ---
Solution 2
--- 0.25870585441589355 seconds ---
Soltuion 1
--- 0.6285257339477539 seconds ---
Solution 2
--- 0.3373727798461914 seconds ---
Soltuion 1
--- 0.7623891830444336 seconds ---
Solution 2
--- 0.18880391120910645 seconds ---
Soltuion 1
--- 0.5125689506530762 seconds ---
Solution 2
--- 0.23384499549865723 seconds ---
Soltuion 1
--- 0.6188468933105469 seconds ---
Solution 2
--- 0.25000977516174316 seconds ---
Soltuion 1
--- 0.6692302227020264 seconds ---
Solution 2
--- 0.5207180976867676 seconds ---
Soltuion 1
--- 1.2534172534942627 seconds ---
Solution 2
--- 0.2665679454803467 seconds ---
Soltuion 1
--- 0.6374101638793945 seconds ---
Solution 2
--- 0.2108619213104248 seconds ---

The solution解决方案

Since all you're doing is checking if the date is less than today's date, then create a new column and add a constant of either 1 or 0.由于您所做的只是检查日期是否小于今天的日期，因此创建一个新列并添加一个 1 或 0 的常量。

Lets first add the constant to the column.让我们首先将常量添加到列中。

df['Output'] = 1

Now, all we have to do is find the point where the date is less than the current date.现在，我们所要做的就是找到日期小于当前日期的点。

First though, we should change the date type of Col1 to a datetime, to make sure we can do proper comparisons.不过，首先，我们应该将 Col1 的日期类型更改为日期时间，以确保我们可以进行正确的比较。

df['Col1'] = pd.to_datetime(df['Col1'], format="%Y-%M-%d)

Then, we look through every date that's less than today, and change the output to 0.然后，我们查看所有小于今天的日期，并将 output 更改为 0。

df.loc[df['Col1'] < today.date(), 'Output'] = 0

Answer 2

While we're still awaiting some more information on the problem, here is what I have so far:虽然我们仍在等待有关该问题的更多信息，但这是我目前掌握的信息：

import pandas as pd


df = pd.DataFrame(
    data={
        "col_1": ["2019-03-19", "2019-03-20", "2030-01-01", "2019-05-15", "2019-07-15"],
        "col_2": [1, 2, 3, 4, 5],
    }
)

df["col_1"] = pd.to_datetime(df["col_1"], infer_datetime_format=True, utc=True)

print(df, end='\n\n')

curr_time = pd.Timestamp.utcnow()

print(curr_time, end='\n\n')

df["col_3"] = df["col_1"] > curr_time

print(df)

Output: Output：

                      col_1  col_2
0 2019-03-19 00:00:00+00:00      1
1 2019-03-20 00:00:00+00:00      2
2 2030-01-01 00:00:00+00:00      3
3 2019-05-15 00:00:00+00:00      4
4 2019-07-15 00:00:00+00:00      5

2020-02-12 02:11:37.212849+00:00

                      col_1  col_2  col_3
0 2019-03-19 00:00:00+00:00      1  False
1 2019-03-20 00:00:00+00:00      2  False
2 2030-01-01 00:00:00+00:00      3   True
3 2019-05-15 00:00:00+00:00      4  False
4 2019-07-15 00:00:00+00:00      5  False

Map function 花费太多时间（Pandas DataFrame）

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-02-12 00:50:50

EDIT - See wwii solution编辑 - 查看二战解决方案

We are assuming the dates are in order我们假设日期是有序的

Test 1 - OP's solution测试 1 - OP 的解决方案

Test 2 - My solution测试 2 - 我的解决方案

The results结果

The solution解决方案

解决方案2
1 2020-02-12 02:11:57

Map function 花费太多时间（Pandas DataFrame）

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-02-12 00:50:50

EDIT - See wwii solution编辑 - 查看二战解决方案

We are assuming the dates are in order我们假设日期是有序的

Test 1 - OP's solution测试 1 - OP 的解决方案

Test 2 - My solution测试 2 - 我的解决方案

The results结果

The solution解决方案

解决方案2 1 2020-02-12 02:11:57

解决方案1
2 已采纳 2020-02-12 00:50:50

解决方案2
1 2020-02-12 02:11:57