[英]insert first row to each group in pandas dataframe
I have a large csv file containing the historic prices of stocks.我有一个包含股票历史价格的大型 csv 文件。 This is a small sample of it:
这是它的一个小样本:
data = pd.DataFrame({'sym': {0: 'msft', 1: 'msft', 2: 'tsla', 3: 'tsla', 4: 'bac', 5: 'bac'}, 'date': {0: '12/7/2021', 1: '12/6/2021', 2: '12/7/2021', 3: '12/6/2021', 4: '12/7/2021', 5: '12/6/2021'}, 'high': {0: 11, 1: 13, 2: 898, 3: 900, 4: 12, 5: 13}})
Now on each day there will be an update for this data and I want to append them to the data
above.现在每天都会有这些数据的更新,我想将它们 append 到上面的
data
。 The updates look like this:更新如下所示:
update = pd.DataFrame({'sym': {0: 'msft', 1: 'tsla', 2: 'bac'}, 'date': {0: '12/8/2021', 1: '12/8/2021', 2: '12/8/2021'}, 'high': {0: 16, 1: 1000, 2: 14}})
What I want is the dataframe below:我想要的是下面的 dataframe:
result = pd.DataFrame({'sym': {0: 'msft', 1: 'msft', 2: 'msft', 3: 'tsla', 4: 'tsla', 5: 'tsla', 6: 'bac', 7: 'bac', 8: 'bac'}, 'date': {0: '12/8/2021', 1: '12/7/2021', 2: '12/6/2021', 3: '12/8/2021', 4: '12/7/2021', 5: '12/6/2021', 6: '12/8/2021', 7: '12/7/2021', 8: '12/6/2021'}, 'high': {0: 16, 1: 11, 2: 13, 3: 1000, 4: 898, 5: 900, 6: 14, 7: 12, 8: 13}})
My current approach is using this code:我目前的方法是使用以下代码:
data = data.append(update)
data = data.sort_values(by=['sym', 'date'])
By tweaking the above approach I can achieve what I want but since I have million rows in my database, I was wondering if there is a faster way other than using sort_values
.通过调整上述方法,我可以实现我想要的,但由于我的数据库中有数百万行,我想知道是否有比使用
sort_values
更快的方法。
result=pd.merge_ordered(data,update,on=['date','high'],left_by='sym',fill_method='ffill').drop(['sym_x','sym_y'],axis=1)
IIUC, you want to keep the order of sym
as it appears in data
while sorting data
in descending order. IIUC,您希望保持
sym
在data
中出现的顺序,同时按降序对data
进行排序。 You can do that by converting sym
-column to category and setting its category order by the order it appears in data
.您可以通过将
sym
-column 转换为 category 并按其在data
中出现的顺序设置其类别顺序来做到这一点。 Then simply sort_values
by ['sym','date']
:然后只需按
['sym','date']
sort_values
:
sorter = data['sym'].drop_duplicates()
out = data.append(update)
out['sym'] = out['sym'].astype("category").cat.set_categories(sorter)
out = out.sort_values(by=['sym','date'], ascending=['sym',False]).reset_index(drop=True)
Output: Output:
sym date high
0 msft 12/8/2021 16
1 msft 12/7/2021 11
2 msft 12/6/2021 13
3 tsla 12/8/2021 1000
4 tsla 12/7/2021 898
5 tsla 12/6/2021 900
6 bac 12/8/2021 14
7 bac 12/7/2021 12
8 bac 12/6/2021 13
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.