[英]faster way of creating pandas dataframe from another dataframe
I have a dataframe with over 41500 records and 3 fields: ID
, start_date
and end_date
.我有一个 dataframe 有超过 41500 条记录和 3 个字段:
ID
、 start_date
和end_date
。
I want to create a separate dataframe out of it with just 2 fields as: ID
and active_years
which will contain records having each identifiers against all the possible years that exists between the start_year and end_year range (inclusive of end year in the range).我想创建一个单独的 dataframe ,其中只有 2 个字段:
ID
和active_years
将包含每个标识符的记录,这些记录针对存在于 start_year 和 end_year 范围之间的所有可能年份(包括范围内的结束年份)。
This is what I'm doing right now, but for 41500 rows it takes more than 2 hours to finish.这就是我现在正在做的事情,但是对于 41500 行,它需要 2 个多小时才能完成。
df = pd.DataFrame(columns=['id', 'active_years'])
ix = 0
for _, row in raw_dataset.iterrows():
st_yr = int(row['start_date'].split('-')[0]) # because dates are in the format yyyy-mm-dd
end_yr = int(row['end_date'].split('-')[0])
for year in range(st_yr, end_yr+1):
df.loc[ix, 'id'] = row['ID']
df.loc[ix, 'active_years'] = year
ix = ix + 1
So is there any faster way to achieve this?那么有没有更快的方法来实现这一点?
[EDIT] some examples to try and work around, [编辑]一些尝试解决的示例,
raw_dataset = pd.DataFrame({'ID':['a121','b142','cd3'],'start_date':['2019-10-09','2017-02-06','2012-12-05'],'end_date':['2020-01-30','2019-08-23','2016-06-18']})
print(raw_dataset)
ID start_date end_date
0 a121 2019-10-09 2020-01-30
1 b142 2017-02-06 2019-08-23
2 cd3 2012-12-05 2016-06-18
# the desired dataframe should look like this
print(desired_df)
id active_years
0 a121 2019
1 a121 2020
2 b142 2017
3 b142 2018
4 b142 2019
5 cd3 2012
6 cd3 2013
7 cd3 2014
8 cd3 2015
9 cd3 2016
Dynamically growing python lists is much faster than dynamically growing numpy arrays (which are the underlying data structure of pandas dataframes).动态增长的 python 列表比动态增长的 numpy arrays (这是 Z3A43B4F88325D9405AZ2 数据帧的底层数据结构)快得多。 See here for a brief explanation.
请参阅此处以获取简要说明。 With that in mind:
考虑到这一点:
import pandas as pd
# Initialize input dataframe
raw_dataset = pd.DataFrame({
'ID':['a121','b142','cd3'],
'start_date':['2019-10-09','2017-02-06','2012-12-05'],
'end_date':['2020-01-30','2019-08-23','2016-06-18'],
})
# Create integer columns for start year and end year
raw_dataset['start_year'] = pd.to_datetime(raw_dataset['start_date']).dt.year
raw_dataset['end_year'] = pd.to_datetime(raw_dataset['end_date']).dt.year
# Iterate over input dataframe rows and individual years
id_list = []
active_years_list = []
for row in raw_dataset.itertuples():
for year in range(row.start_year, row.end_year+1):
id_list.append(row.ID)
active_years_list.append(year)
# Create result dataframe from lists
desired_df = pd.DataFrame({
'id': id_list,
'active_years': active_years_list,
})
print(desired_df)
# Output:
# id active_years
# 0 a121 2019
# 1 a121 2020
# 2 b142 2017
# 3 b142 2018
# 4 b142 2019
# 5 cd3 2012
# 6 cd3 2013
# 7 cd3 2014
# 8 cd3 2015
# 9 cd3 2016
the originally intended method using string.split
might be even faster, using Xukrao's approach :使用Xukrao 的方法,使用
string.split
的最初预期方法可能会更快:
import timeit
setup = """
import pandas as pd
raw_dataset = pd.DataFrame({
'ID':['a121','b142','cd3'],
'start_date':['2019-10-09','2017-02-06','2012-12-05'],
'end_date':['2020-01-30','2019-08-23','2016-06-18'],
})
"""
code0 = """
raw_dataset['start_year'] = pd.to_datetime(raw_dataset['start_date']).dt.year
raw_dataset['end_year'] = pd.to_datetime(raw_dataset['end_date']).dt.year
id_list = []
active_years_list = []
for row in raw_dataset.itertuples():
for year in range(row.start_year, row.end_year+1):
id_list.append(row.ID)
active_years_list.append(year)
desired_df = pd.DataFrame({
'id': id_list,
'active_years': active_years_list,
})
"""
code1 = """
id_list = []
active_years_list = []
for row in raw_dataset.itertuples():
y0, y1 = int(row.start_date.split('-')[0]), int(row.end_date.split('-')[0])
for year in range(y0, y1):
id_list.append(row.ID)
active_years_list.append(year)
desired_df = pd.DataFrame({
'id': id_list,
'active_years': active_years_list,
})
"""
n = 1000
t0 = timeit.timeit(stmt=code0, setup=setup, number=n)
t1 = timeit.timeit(stmt=code1, setup=setup, number=n)
print(f"/w datetime conv.: {t0/n:.3E}, stringsplit: {t1/n:.3E}, ratio: {t0/t1:.2f}")
# /w datetime conv.: 2.340E-03, stringsplit: 8.047E-04, ratio: 2.91
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.