[英]Pandas - call user defined function on subset dataframe
I am creating a count function on subsets of Pandas DataFrame and intends to export a dictionary/spreadsheet data that consists only of the groupby criteria and the counting results. 我正在Pandas DataFrame的子集上创建一个count函数,并打算导出仅由groupby条件和计数结果组成的字典/电子表格数据。
In [1]: df = pd.DataFrame([[Buy, A, 123, NEW, 500, 20190101-09:00:00am], [Buy, A, 124, CXL, 500, 20190101-09:00:01am], [Buy, A, 125, NEW, 500, 20190101-09:00:03am], [Buy, A, 126, REPLACE, 300, 20190101-09:00:10am], [Buy, B, 210, NEW, 1000, 20190101-09:10:00am], [Sell, B, 345, NEW, 200, 20190101-09:00:00am], [Sell, C, 412, NEW, 100, 20190101-09:00:00am], [Sell, C, 413, NEW, 200, 20190101-09:01:00am], [Sell, C, 414, CXL, 50, 20190101-09:02:00am]], columns=['side', 'sender', 'id', 'type', ''quantity', 'receive_time'])
Out[1]:
side sender id type quantity receive_time
0 Buy A 123 NEW 500 20190101-09:00:00am
1 Buy A 124 CXL 500 20190101-09:00:01am
2 Buy A 125 NEW 500 20190101-09:00:03am
3 Buy A 126 REPLACE 300 20190101-09:00:10am
4 Buy B 210 NEW 1000 20190101-09:10:00am
5 Buy B 345 NEW 200 20190101-09:00:00am
6 Sell C 412 NEW 100 20190101-09:00:00am
7 Sell C 413 NEW 200 20190101-09:01:00am
8 Sell C 414 CXL 50 20190101-09:02:00am
The count function is as below (mydf is passed in as a subset of the dataframe): 计数功能如下(mydf作为数据帧的子集传入):
def ordercount(mydf):
num = 0.0
if mydf.type == 'NEW':
num = num + mydf.qty
elif mydf.type == 'REPLACE':
num = mydf.qty
elif mydf.type == 'CXL':
num = num - mydf.qty
else:
pass
orderdict = dict.fromkeys([mydf.side, mydf.sender, mydf.id], num)
return orderdict
After reading the data from csv, I group it by some criteria and also sort by time: 从csv中读取数据后,我按一些标准对其进行了分组,还按时间进行了排序:
df = pd.read_csv('xxxxxxxxx.csv, sep='|', header=0, engine='python', names=col_names)
sorted_df = df.groupby(['side', 'sender', 'id']).apply(lambda_df:_df.sort_values(by=['time']))
Then call the previously defined function on the sorted data: 然后对排序后的数据调用先前定义的函数:
print(sorted_df.agg(ordercount))
But the value error kept bumping up saying too many lines to call. 但是值错误不断增加,导致呼叫太多行。
The function way of counting data may not be efficient but it is the most straightforward way that I can think of to match order types and count quantity accordingly. 计数数据的功能方式可能并不高效,但它是我可以想到的最简单的方式来匹配订单类型并相应地计算数量。 I expect the program to output a table where only side, sender, id and counted quantity are shown. 我希望程序输出一张只显示边,发件人,身份证和计数数量的表。 Is there any way to achieve this? 有什么办法可以做到这一点? Thanks. 谢谢。
Expected output: 预期产量:
side sender total_order_num trade_date
0 Buy A 300 20190101
1 Buy B 1200 20190101
2 Sell C 250 20190101
I believe your function is not easy to apply at once because you are doing different operations depending on the rows. 我认为您的函数一次应用并不容易,因为您根据行执行不同的操作。 This would be OK if you only had +
and -
as your operations but you replace
the value at some point and then continue on with the other operations. 如果您只有+
和-
作为操作,但是在某个时候replace
了该值,然后继续进行其他操作,则可以。 Because of that, a loop might just be simpler or you can spend some time to have a nice function to accomplish the task. 因此,循环可能会更简单,或者您可以花一些时间来拥有一个不错的功能来完成任务。
This is what I have. 这就是我所拥有的。 All I really did was change your ordercount
so that it operates directly on a subset which you can get by simply grouping. 我真正ordercount
就是更改您的ordercount
以便直接ordercount
集进行操作,您可以通过简单地分组来获得。 You can either sort by time before grouping or you could do it in the ordercount
function. 您可以在分组之前按时间排序,也可以在ordercount
函数中进行ordercount
。 Hopefully this helps a bit. 希望这会有所帮助。
import pandas as pd
df = pd.DataFrame([['Buy', 'A', 123, 'NEW', 500, '20190101-09:00:00am'],
['Buy', 'A', 124, 'CXL', 500, '20190101-09:00:01am'],
['Buy', 'A', 125, 'NEW', 500, '20190101-09:00:03am'],
['Buy', 'A', 126, 'REPLACE', 300, '20190101-09:00:10am'],
['Buy', 'B', 210, 'NEW', 1000, '20190101-09:10:00am'],
['Buy', 'B', 345, 'NEW', 200, '20190101-09:00:00am'],
['Sell', 'C', 412, 'NEW', 100, '20190101-09:00:00am'],
['Sell', 'C', 413, 'NEW', 200, '20190101-09:01:00am'],
['Sell', 'C', 414, 'CXL', 50, '20190101-09:02:00am']],
columns=['side', 'sender', 'id', 'type', 'quantity', 'receive_time'])
df['receive_time'] = pd.to_datetime(df['receive_time'])
df['receive_date'] = df['receive_time'].dt.date # you do not need the time stamps
def ordercount(mydf):
mydf_ = mydf.sort_values('receive_time')[['type', 'quantity']].copy()
num = 0
for val in mydf_.values:
type_, quantity = val
# val is going to be a list like ['NEW', 500]. All I am doing above is unpack the list into two variables.
# You can find many resources on unpacking iterables
if type_ == 'NEW':
num += quantity
elif type_ == 'REPLACE':
num = quantity
elif type_ == 'CXL':
num -= quantity
else:
pass
return num
mydf = df.groupby(['side', 'sender', 'receive_date']).apply(ordercount).reset_index()
Output: 输出:
|----|--------|----------|---------------------|------|
| | side | sender | receive_date | 0 |
|----|--------|----------|---------------------|------|
| 0 | Buy | A | 2019-01-01 00:00:00 | 300 |
|----|--------|----------|---------------------|------|
| 1 | Buy | B | 2019-01-01 00:00:00 | 1200 |
|----|--------|----------|---------------------|------|
| 2 | Sell | C | 2019-01-01 00:00:00 | 250 |
|----|--------|----------|---------------------|------|
You can easily rename the column '0' as you wish. 您可以根据需要轻松地重命名列“ 0”。 I am still not sure how your trade_date
is defined. 我仍然不确定您的trade_date
是如何定义的。 Will your data only have one date? 您的数据只有一个日期吗? What happens when you have more than one date? 如果您有多个约会,该怎么办? Are you taking the min?... 你要分钟吗?...
Edit: If you tried it with this dataframe you can see the groups with the dates working as expected. 编辑:如果您对此数据框进行过尝试,则可以看到日期按预期工作的组。
df = pd.DataFrame([['Buy', 'A', 123, 'NEW', 500, '20190101-09:00:00am'],
['Buy', 'A', 124, 'CXL', 500, '20190101-09:00:01am'],
['Buy', 'A', 125, 'NEW', 500, '20190101-09:00:03am'],
['Buy', 'A', 126, 'REPLACE', 300, '20190101-09:00:10am'],
['Buy', 'B', 210, 'NEW', 1000, '20190101-09:10:00am'],
['Buy', 'B', 345, 'NEW', 200, '20190101-09:00:00am'],
['Sell', 'C', 412, 'NEW', 100, '20190101-09:00:00am'],
['Sell', 'C', 413, 'NEW', 200, '20190101-09:01:00am'],
['Sell', 'C', 414, 'CXL', 50, '20190101-09:02:00am'],
['Buy', 'A', 123, 'NEW', 500, '20190102-09:00:00am'],
['Buy', 'A', 124, 'CXL', 500, '20190102-09:00:01am'],
['Buy', 'A', 125, 'NEW', 500, '20190102-09:00:03am'],
['Buy', 'A', 126, 'REPLACE', 300, '20190102-09:00:10am'],
['Buy', 'B', 210, 'NEW', 1000, '20190102-09:10:00am'],
['Buy', 'B', 345, 'NEW', 200, '20190102-09:00:00am'],
['Sell', 'C', 412, 'NEW', 100, '20190102-09:00:00am'],
['Sell', 'C', 413, 'NEW', 200, '20190102-09:01:00am'],
['Sell', 'C', 414, 'CXL', 50, '20190102-09:02:00am']],
columns=['side', 'sender', 'id', 'type', 'quantity', 'receive_time'])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.