简体   繁体   English

将第一行插入 pandas dataframe 中的每个组

[英]insert first row to each group in pandas dataframe

I have a large csv file containing the historic prices of stocks.我有一个包含股票历史价格的大型 csv 文件。 This is a small sample of it:这是它的一个小样本:

data = pd.DataFrame({'sym': {0: 'msft', 1: 'msft', 2: 'tsla', 3: 'tsla', 4: 'bac', 5: 'bac'}, 'date': {0: '12/7/2021', 1: '12/6/2021', 2: '12/7/2021', 3: '12/6/2021', 4: '12/7/2021', 5: '12/6/2021'}, 'high': {0: 11, 1: 13, 2: 898, 3: 900, 4: 12, 5: 13}})

Now on each day there will be an update for this data and I want to append them to the data above.现在每天都会有这些数据的更新,我想将它们 append 到上面的data The updates look like this:更新如下所示:

update = pd.DataFrame({'sym': {0: 'msft', 1: 'tsla', 2: 'bac'}, 'date': {0: '12/8/2021', 1: '12/8/2021', 2: '12/8/2021'}, 'high': {0: 16, 1: 1000, 2: 14}})

What I want is the dataframe below:我想要的是下面的 dataframe:

result = pd.DataFrame({'sym': {0: 'msft', 1: 'msft', 2: 'msft', 3: 'tsla', 4: 'tsla', 5: 'tsla', 6: 'bac', 7: 'bac', 8: 'bac'}, 'date': {0: '12/8/2021', 1: '12/7/2021', 2: '12/6/2021', 3: '12/8/2021', 4: '12/7/2021', 5: '12/6/2021', 6: '12/8/2021', 7: '12/7/2021', 8: '12/6/2021'}, 'high': {0: 16, 1: 11, 2: 13, 3: 1000, 4: 898, 5: 900, 6: 14, 7: 12, 8: 13}})

My current approach is using this code:我目前的方法是使用以下代码:

data = data.append(update)
data = data.sort_values(by=['sym', 'date'])

By tweaking the above approach I can achieve what I want but since I have million rows in my database, I was wondering if there is a faster way other than using sort_values .通过调整上述方法,我可以实现我想要的,但由于我的数据库中有数百万行,我想知道是否有比使用sort_values更快的方法。

result=pd.merge_ordered(data,update,on=['date','high'],left_by='sym',fill_method='ffill').drop(['sym_x','sym_y'],axis=1)

IIUC, you want to keep the order of sym as it appears in data while sorting data in descending order. IIUC,您希望保持symdata中出现的顺序,同时按降序对data进行排序。 You can do that by converting sym -column to category and setting its category order by the order it appears in data .您可以通过将sym -column 转换为 category 并按其在data中出现的顺序设置其类别顺序来做到这一点。 Then simply sort_values by ['sym','date'] :然后只需按['sym','date'] sort_values

sorter = data['sym'].drop_duplicates()
out = data.append(update)
out['sym'] = out['sym'].astype("category").cat.set_categories(sorter)
out = out.sort_values(by=['sym','date'], ascending=['sym',False]).reset_index(drop=True)

Output: Output:

    sym       date  high
0  msft  12/8/2021    16
1  msft  12/7/2021    11
2  msft  12/6/2021    13
3  tsla  12/8/2021  1000
4  tsla  12/7/2021   898
5  tsla  12/6/2021   900
6   bac  12/8/2021    14
7   bac  12/7/2021    12
8   bac  12/6/2021    13

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM