从另一个 dataframe 创建 pandas dataframe 的更快方法

Question

I have a dataframe with over 41500 records and 3 fields: ID , start_date and end_date .我有一个 dataframe 有超过 41500 条记录和 3 个字段： ID 、 start_date和end_date 。

I want to create a separate dataframe out of it with just 2 fields as: ID and active_years which will contain records having each identifiers against all the possible years that exists between the start_year and end_year range (inclusive of end year in the range).我想创建一个单独的 dataframe ，其中只有 2 个字段： ID和active_years将包含每个标识符的记录，这些记录针对存在于 start_year 和 end_year 范围之间的所有可能年份（包括范围内的结束年份）。

This is what I'm doing right now, but for 41500 rows it takes more than 2 hours to finish.这就是我现在正在做的事情，但是对于 41500 行，它需要 2 个多小时才能完成。

df = pd.DataFrame(columns=['id', 'active_years'])
ix = 0

for _, row in raw_dataset.iterrows():

    st_yr = int(row['start_date'].split('-')[0]) # because dates are in the format yyyy-mm-dd
    end_yr = int(row['end_date'].split('-')[0])

    for year in range(st_yr, end_yr+1):

        df.loc[ix, 'id'] = row['ID']
        df.loc[ix, 'active_years'] = year
        ix = ix + 1

So is there any faster way to achieve this?那么有没有更快的方法来实现这一点？

[EDIT] some examples to try and work around, [编辑]一些尝试解决的示例，

raw_dataset = pd.DataFrame({'ID':['a121','b142','cd3'],'start_date':['2019-10-09','2017-02-06','2012-12-05'],'end_date':['2020-01-30','2019-08-23','2016-06-18']})

print(raw_dataset)
     ID  start_date    end_date
0  a121  2019-10-09  2020-01-30
1  b142  2017-02-06  2019-08-23
2   cd3  2012-12-05  2016-06-18

# the desired dataframe should look like this
print(desired_df)
     id  active_years
0  a121  2019
1  a121  2020
2  b142  2017
3  b142  2018
4  b142  2019
5   cd3  2012
6   cd3  2013
7   cd3  2014
8   cd3  2015
9   cd3  2016

Answer 1

Dynamically growing python lists is much faster than dynamically growing numpy arrays (which are the underlying data structure of pandas dataframes).动态增长的 python 列表比动态增长的 numpy arrays （这是 Z3A43B4F88325D9405AZ2 数据帧的底层数据结构）快得多。 See here for a brief explanation.请参阅此处以获取简要说明。 With that in mind:考虑到这一点：

import pandas as pd

# Initialize input dataframe
raw_dataset = pd.DataFrame({
    'ID':['a121','b142','cd3'],
    'start_date':['2019-10-09','2017-02-06','2012-12-05'],
    'end_date':['2020-01-30','2019-08-23','2016-06-18'],
})

# Create integer columns for start year and end year
raw_dataset['start_year'] = pd.to_datetime(raw_dataset['start_date']).dt.year
raw_dataset['end_year'] = pd.to_datetime(raw_dataset['end_date']).dt.year

# Iterate over input dataframe rows and individual years
id_list = []
active_years_list = []
for row in raw_dataset.itertuples():
    for year in range(row.start_year, row.end_year+1):
        id_list.append(row.ID)
        active_years_list.append(year)

# Create result dataframe from lists
desired_df = pd.DataFrame({
    'id': id_list,
    'active_years': active_years_list,
})

print(desired_df)
# Output:
#     id  active_years
# 0  a121          2019
# 1  a121          2020
# 2  b142          2017
# 3  b142          2018
# 4  b142          2019
# 5   cd3          2012
# 6   cd3          2013
# 7   cd3          2014
# 8   cd3          2015
# 9   cd3          2016

Answer 2

the originally intended method using string.split might be even faster, using Xukrao's approach :使用Xukrao 的方法，使用string.split的最初预期方法可能会更快：

import timeit

setup = """
import pandas as pd
raw_dataset = pd.DataFrame({
    'ID':['a121','b142','cd3'],
    'start_date':['2019-10-09','2017-02-06','2012-12-05'],
    'end_date':['2020-01-30','2019-08-23','2016-06-18'],
})
"""

code0 = """
raw_dataset['start_year'] = pd.to_datetime(raw_dataset['start_date']).dt.year
raw_dataset['end_year'] = pd.to_datetime(raw_dataset['end_date']).dt.year
id_list = []
active_years_list = []
for row in raw_dataset.itertuples():
    for year in range(row.start_year, row.end_year+1):
        id_list.append(row.ID)
        active_years_list.append(year)
desired_df = pd.DataFrame({
    'id': id_list,
    'active_years': active_years_list,
})
"""

code1 = """
id_list = []
active_years_list = []
for row in raw_dataset.itertuples():
    y0, y1 = int(row.start_date.split('-')[0]), int(row.end_date.split('-')[0])
    for year in range(y0, y1):
        id_list.append(row.ID)
        active_years_list.append(year)
desired_df = pd.DataFrame({
    'id': id_list,
    'active_years': active_years_list,
})
"""

n = 1000
t0 = timeit.timeit(stmt=code0, setup=setup, number=n)
t1 = timeit.timeit(stmt=code1, setup=setup, number=n)

print(f"/w datetime conv.: {t0/n:.3E}, stringsplit: {t1/n:.3E}, ratio: {t0/t1:.2f}")
# /w datetime conv.: 2.340E-03, stringsplit: 8.047E-04, ratio: 2.91

从另一个 dataframe 创建 pandas dataframe 的更快方法

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-10-04 08:07:26

解决方案2
0 2019-10-04 08:22:11

从另一个 dataframe 创建 pandas dataframe 的更快方法

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-10-04 08:07:26

解决方案2 0 2019-10-04 08:22:11

解决方案1
2 已采纳 2019-10-04 08:07:26

解决方案2
0 2019-10-04 08:22:11