熊猫-在子集数据帧上调用用户定义的函数

Question

I am creating a count function on subsets of Pandas DataFrame and intends to export a dictionary/spreadsheet data that consists only of the groupby criteria and the counting results. 我正在Pandas DataFrame的子集上创建一个count函数，并打算导出仅由groupby条件和计数结果组成的字典/电子表格数据。

In [1]: df = pd.DataFrame([[Buy, A, 123, NEW, 500, 20190101-09:00:00am], [Buy, A, 124, CXL, 500, 20190101-09:00:01am], [Buy, A, 125, NEW, 500, 20190101-09:00:03am], [Buy, A, 126, REPLACE, 300, 20190101-09:00:10am], [Buy, B, 210, NEW, 1000, 20190101-09:10:00am], [Sell, B, 345, NEW, 200, 20190101-09:00:00am], [Sell, C, 412, NEW, 100, 20190101-09:00:00am], [Sell, C, 413, NEW, 200, 20190101-09:01:00am], [Sell, C, 414, CXL, 50, 20190101-09:02:00am]], columns=['side', 'sender', 'id', 'type', ''quantity', 'receive_time'])

Out[1]: 
   side  sender  id    type     quantity  receive_time 
0  Buy   A       123   NEW      500       20190101-09:00:00am
1  Buy   A       124   CXL      500       20190101-09:00:01am
2  Buy   A       125   NEW      500       20190101-09:00:03am
3  Buy   A       126   REPLACE  300       20190101-09:00:10am
4  Buy   B       210   NEW      1000      20190101-09:10:00am
5  Buy   B       345   NEW      200       20190101-09:00:00am
6  Sell  C       412   NEW      100       20190101-09:00:00am
7  Sell  C       413   NEW      200       20190101-09:01:00am
8  Sell  C       414   CXL      50        20190101-09:02:00am

The count function is as below (mydf is passed in as a subset of the dataframe): 计数功能如下（mydf作为数据帧的子集传入）：

def ordercount(mydf):
   num = 0.0
   if mydf.type == 'NEW':
      num = num + mydf.qty
   elif mydf.type == 'REPLACE':
      num = mydf.qty
   elif mydf.type == 'CXL':
      num = num - mydf.qty
   else: 
      pass
   orderdict = dict.fromkeys([mydf.side, mydf.sender, mydf.id], num)
   return orderdict

After reading the data from csv, I group it by some criteria and also sort by time: 从csv中读取数据后，我按一些标准对其进行了分组，还按时间进行了排序：

df = pd.read_csv('xxxxxxxxx.csv, sep='|', header=0, engine='python', names=col_names)
sorted_df = df.groupby(['side', 'sender', 'id']).apply(lambda_df:_df.sort_values(by=['time']))

Then call the previously defined function on the sorted data: 然后对排序后的数据调用先前定义的函数：

print(sorted_df.agg(ordercount))

But the value error kept bumping up saying too many lines to call. 但是值错误不断增加，导致呼叫太多行。

The function way of counting data may not be efficient but it is the most straightforward way that I can think of to match order types and count quantity accordingly. 计数数据的功能方式可能并不高效，但它是我可以想到的最简单的方式来匹配订单类型并相应地计算数量。 I expect the program to output a table where only side, sender, id and counted quantity are shown. 我希望程序输出一张只显示边，发件人，身份证和计数数量的表。 Is there any way to achieve this? 有什么办法可以做到这一点？ Thanks. 谢谢。

Expected output: 预期产量：

   side   sender   total_order_num   trade_date 
0  Buy    A        300               20190101
1  Buy    B        1200              20190101
2  Sell   C        250               20190101

Answer 1

I believe your function is not easy to apply at once because you are doing different operations depending on the rows. 我认为您的函数一次应用并不容易，因为您根据行执行不同的操作。 This would be OK if you only had + and - as your operations but you replace the value at some point and then continue on with the other operations. 如果您只有+和-作为操作，但是在某个时候replace了该值，然后继续进行其他操作，则可以。 Because of that, a loop might just be simpler or you can spend some time to have a nice function to accomplish the task. 因此，循环可能会更简单，或者您可以花一些时间来拥有一个不错的功能来完成任务。

This is what I have. 这就是我所拥有的。 All I really did was change your ordercount so that it operates directly on a subset which you can get by simply grouping. 我真正ordercount就是更改您的ordercount以便直接ordercount集进行操作，您可以通过简单地分组来获得。 You can either sort by time before grouping or you could do it in the ordercount function. 您可以在分组之前按时间排序，也可以在ordercount函数中进行ordercount 。 Hopefully this helps a bit. 希望这会有所帮助。

import pandas as pd
df = pd.DataFrame([['Buy', 'A', 123, 'NEW', 500, '20190101-09:00:00am'],
                   ['Buy', 'A', 124, 'CXL', 500, '20190101-09:00:01am'],
                   ['Buy', 'A', 125, 'NEW', 500, '20190101-09:00:03am'],
                   ['Buy', 'A', 126, 'REPLACE', 300, '20190101-09:00:10am'],
                   ['Buy', 'B', 210, 'NEW', 1000, '20190101-09:10:00am'],
                   ['Buy', 'B', 345, 'NEW', 200, '20190101-09:00:00am'],
                   ['Sell', 'C', 412, 'NEW', 100, '20190101-09:00:00am'],
                   ['Sell', 'C', 413, 'NEW', 200, '20190101-09:01:00am'],
                   ['Sell', 'C', 414, 'CXL', 50, '20190101-09:02:00am']],
columns=['side', 'sender', 'id', 'type', 'quantity', 'receive_time'])

df['receive_time'] = pd.to_datetime(df['receive_time'])
df['receive_date'] = df['receive_time'].dt.date # you do not need the time stamps


def ordercount(mydf):
    mydf_ = mydf.sort_values('receive_time')[['type', 'quantity']].copy()
    num = 0
    for val in mydf_.values:
        type_, quantity = val
        # val is going to be a list like ['NEW', 500]. All I am doing above is unpack the list into two variables.
        # You can find many resources on unpacking iterables
        if type_ == 'NEW':
            num += quantity
        elif type_ == 'REPLACE':
            num = quantity
        elif type_ == 'CXL':
            num -= quantity
        else:
            pass
    return num

mydf = df.groupby(['side', 'sender', 'receive_date']).apply(ordercount).reset_index()

Output: 输出：

|----|--------|----------|---------------------|------|
|    | side   | sender   | receive_date        |    0 |
|----|--------|----------|---------------------|------|
|  0 | Buy    | A        | 2019-01-01 00:00:00 |  300 |
|----|--------|----------|---------------------|------|
|  1 | Buy    | B        | 2019-01-01 00:00:00 | 1200 |
|----|--------|----------|---------------------|------|
|  2 | Sell   | C        | 2019-01-01 00:00:00 |  250 |
|----|--------|----------|---------------------|------|

You can easily rename the column '0' as you wish. 您可以根据需要轻松地重命名列“ 0”。 I am still not sure how your trade_date is defined. 我仍然不确定您的trade_date是如何定义的。 Will your data only have one date? 您的数据只有一个日期吗？ What happens when you have more than one date? 如果您有多个约会，该怎么办？ Are you taking the min?... 你要分钟吗？...

Edit: If you tried it with this dataframe you can see the groups with the dates working as expected. 编辑：如果您对此数据框进行过尝试，则可以看到日期按预期工作的组。

df = pd.DataFrame([['Buy', 'A', 123, 'NEW', 500, '20190101-09:00:00am'],
                   ['Buy', 'A', 124, 'CXL', 500, '20190101-09:00:01am'],
                   ['Buy', 'A', 125, 'NEW', 500, '20190101-09:00:03am'],
                   ['Buy', 'A', 126, 'REPLACE', 300, '20190101-09:00:10am'],
                   ['Buy', 'B', 210, 'NEW', 1000, '20190101-09:10:00am'],
                   ['Buy', 'B', 345, 'NEW', 200, '20190101-09:00:00am'],
                   ['Sell', 'C', 412, 'NEW', 100, '20190101-09:00:00am'],
                   ['Sell', 'C', 413, 'NEW', 200, '20190101-09:01:00am'],
                   ['Sell', 'C', 414, 'CXL', 50, '20190101-09:02:00am'],
                   ['Buy', 'A', 123, 'NEW', 500, '20190102-09:00:00am'],
                   ['Buy', 'A', 124, 'CXL', 500, '20190102-09:00:01am'],
                   ['Buy', 'A', 125, 'NEW', 500, '20190102-09:00:03am'],
                   ['Buy', 'A', 126, 'REPLACE', 300, '20190102-09:00:10am'],
                   ['Buy', 'B', 210, 'NEW', 1000, '20190102-09:10:00am'],
                   ['Buy', 'B', 345, 'NEW', 200, '20190102-09:00:00am'],
                   ['Sell', 'C', 412, 'NEW', 100, '20190102-09:00:00am'],
                   ['Sell', 'C', 413, 'NEW', 200, '20190102-09:01:00am'],
                   ['Sell', 'C', 414, 'CXL', 50, '20190102-09:02:00am']],
columns=['side', 'sender', 'id', 'type', 'quantity', 'receive_time'])

熊猫-在子集数据帧上调用用户定义的函数

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-07-15 15:44:15

熊猫-在子集数据帧上调用用户定义的函数

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-07-15 15:44:15

解决方案1
0 已采纳 2019-07-15 15:44:15