简体   繁体   English

从另一个 dataframe 创建 pandas dataframe 的更快方法

[英]faster way of creating pandas dataframe from another dataframe

I have a dataframe with over 41500 records and 3 fields: ID , start_date and end_date .我有一个 dataframe 有超过 41500 条记录和 3 个字段: IDstart_dateend_date

I want to create a separate dataframe out of it with just 2 fields as: ID and active_years which will contain records having each identifiers against all the possible years that exists between the start_year and end_year range (inclusive of end year in the range).我想创建一个单独的 dataframe ,其中只有 2 个字段: IDactive_years将包含每个标识符的记录,这些记录针对存在于 start_year 和 end_year 范围之间的所有可能年份(包括范围内的结束年份)。

This is what I'm doing right now, but for 41500 rows it takes more than 2 hours to finish.这就是我现在正在做的事情,但是对于 41500 行,它需要 2 个多小时才能完成。

df = pd.DataFrame(columns=['id', 'active_years'])
ix = 0

for _, row in raw_dataset.iterrows():

    st_yr = int(row['start_date'].split('-')[0]) # because dates are in the format yyyy-mm-dd
    end_yr = int(row['end_date'].split('-')[0])

    for year in range(st_yr, end_yr+1):

        df.loc[ix, 'id'] = row['ID']
        df.loc[ix, 'active_years'] = year
        ix = ix + 1

So is there any faster way to achieve this?那么有没有更快的方法来实现这一点?

[EDIT] some examples to try and work around, [编辑]一些尝试解决的示例,

raw_dataset = pd.DataFrame({'ID':['a121','b142','cd3'],'start_date':['2019-10-09','2017-02-06','2012-12-05'],'end_date':['2020-01-30','2019-08-23','2016-06-18']})

print(raw_dataset)
     ID  start_date    end_date
0  a121  2019-10-09  2020-01-30
1  b142  2017-02-06  2019-08-23
2   cd3  2012-12-05  2016-06-18

# the desired dataframe should look like this
print(desired_df)
     id  active_years
0  a121  2019
1  a121  2020
2  b142  2017
3  b142  2018
4  b142  2019
5   cd3  2012
6   cd3  2013
7   cd3  2014
8   cd3  2015
9   cd3  2016

Dynamically growing python lists is much faster than dynamically growing numpy arrays (which are the underlying data structure of pandas dataframes).动态增长的 python 列表比动态增长的 numpy arrays (这是 Z3A43B4F88325D9405AZ2 数据帧的底层数据结构)快得多。 See here for a brief explanation.请参阅此处以获取简要说明。 With that in mind:考虑到这一点:

import pandas as pd

# Initialize input dataframe
raw_dataset = pd.DataFrame({
    'ID':['a121','b142','cd3'],
    'start_date':['2019-10-09','2017-02-06','2012-12-05'],
    'end_date':['2020-01-30','2019-08-23','2016-06-18'],
})

# Create integer columns for start year and end year
raw_dataset['start_year'] = pd.to_datetime(raw_dataset['start_date']).dt.year
raw_dataset['end_year'] = pd.to_datetime(raw_dataset['end_date']).dt.year

# Iterate over input dataframe rows and individual years
id_list = []
active_years_list = []
for row in raw_dataset.itertuples():
    for year in range(row.start_year, row.end_year+1):
        id_list.append(row.ID)
        active_years_list.append(year)

# Create result dataframe from lists
desired_df = pd.DataFrame({
    'id': id_list,
    'active_years': active_years_list,
})

print(desired_df)
# Output:
#     id  active_years
# 0  a121          2019
# 1  a121          2020
# 2  b142          2017
# 3  b142          2018
# 4  b142          2019
# 5   cd3          2012
# 6   cd3          2013
# 7   cd3          2014
# 8   cd3          2015
# 9   cd3          2016

the originally intended method using string.split might be even faster, using Xukrao's approach :使用Xukrao 的方法,使用string.split的最初预期方法可能会更快:

import timeit

setup = """
import pandas as pd
raw_dataset = pd.DataFrame({
    'ID':['a121','b142','cd3'],
    'start_date':['2019-10-09','2017-02-06','2012-12-05'],
    'end_date':['2020-01-30','2019-08-23','2016-06-18'],
})
"""

code0 = """
raw_dataset['start_year'] = pd.to_datetime(raw_dataset['start_date']).dt.year
raw_dataset['end_year'] = pd.to_datetime(raw_dataset['end_date']).dt.year
id_list = []
active_years_list = []
for row in raw_dataset.itertuples():
    for year in range(row.start_year, row.end_year+1):
        id_list.append(row.ID)
        active_years_list.append(year)
desired_df = pd.DataFrame({
    'id': id_list,
    'active_years': active_years_list,
})
"""

code1 = """
id_list = []
active_years_list = []
for row in raw_dataset.itertuples():
    y0, y1 = int(row.start_date.split('-')[0]), int(row.end_date.split('-')[0])
    for year in range(y0, y1):
        id_list.append(row.ID)
        active_years_list.append(year)
desired_df = pd.DataFrame({
    'id': id_list,
    'active_years': active_years_list,
})
"""

n = 1000
t0 = timeit.timeit(stmt=code0, setup=setup, number=n)
t1 = timeit.timeit(stmt=code1, setup=setup, number=n)

print(f"/w datetime conv.: {t0/n:.3E}, stringsplit: {t1/n:.3E}, ratio: {t0/t1:.2f}")
# /w datetime conv.: 2.340E-03, stringsplit: 8.047E-04, ratio: 2.91

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从Pandas Dataframe创建列表列表的更快方法 - Faster way of creating List of Lists from Pandas Dataframe 有没有更快的方法来生成这个 pandas dataframe? - Is there a faster way to generate this pandas dataframe? 从Pandas Dataframe中现有的日期和时间列创建datetime列的更快方法 - Faster way of creating a datetime column from existing date and time columns in Pandas Dataframe 更快地从一个数据帧获取行数据(基于条件)并合并到另一个b pandas python上 - Faster way to get row data (based on a condition) from one dataframe and merge onto another b pandas python 将数据从MongoDB游标加载到pandas Dataframe的更快方法 - Faster way to load data from MongoDB cursor to pandas Dataframe 从带有描述的 Numpy nd 数组创建 Pandas DataFrame 的更快方法? - Faster way to create Pandas DataFrame from a Numpy nd array with descriptions? 寻找更快的方法从熊猫数据框中删除所有占位符 - seeking a faster way to drop all placeholders from pandas dataframe 从熊猫数据框中删除重复索引的更快方法 - Faster way to remove duplicated indices from pandas dataframe 从嵌套列表中获取 pandas dataframe 的任何更快的方法 - Any faster way to get pandas dataframe from a nested list 从另一个DataFrame替换pandas.DataFrame中的值的优雅方法 - Elegant way to replace values in pandas.DataFrame from another DataFrame
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM