简体   繁体   English

创建一个循环以找出前20天内的销售数量

[英]creating a loop to find out number of sales within the first 20 days

I am a newbie to py and cannot figure out how to find the number of sales calls 20 days after the FIRST sale. 我是py的新手,无法弄清楚如何在首次销售后20天找到销售电话的数量。 The question is asking me to figure out the percent of sales people who made at least 10 sales calls in their first 20 days. 问题是让我计算出在开始的20天内至少拨打了10个电话的销售人员的百分比。 Each row is a sales call and the salespeople are represented by the col id , the sales call time in recorded in call_starttime . 每行都是一个销售电话,销售人员用col id表示,销售电话的时间记录在call_starttime

The df is fairly simple and looks like this df非常简单,看起来像这样

    id      call_starttime  level
0   66547   7/28/2015 23:18 1
1   66272   8/10/2015 20:48 0
2   66547   8/20/2015 17:32 2
3   66272   8/31/2015 18:21 0
4   66272   8/31/2015 20:25 0

I already have counted the number of convos per id and can have filtered out anyone who has not made at least 10 salescall 我已经计算出每个id的convos数量,并且可以过滤掉未进行至少10次电话销售的任何人

The code is am currently using is 目前正在使用的代码是

df_withcount=df.groupby(['cc_user_id','cc_cohort']).size().reset_index(name='count')
df_20andmore=df_withcount.loc[(df_withcount['count'] >= 20)]

I expect the output to give me the number of ids (sales people) who in their first 20 days made at least 10 calls. 我希望输出结果可以告诉我ID(销售人员)在最初20天内至少拨打了10次电话的数量。 As of now I can only figure out how to do made at least 10 calls over all time 到目前为止,我只能弄清楚该如何在整个时间内至少拨打10个电话

I assume that call_starttime column is of DateTime type. 我假设call_starttime列为 DateTime类型。

Let's start from a simplified solution, checking only the second call (not 10 subsequent calls). 让我们从一个简化的解决方案开始,仅检查第二个呼叫(而不检查10个后续呼叫)。

I changed slightly your test data, so that person with id = 66272 has the second call within 20 days after the first (August 10 and 19): 我略微更改了您的测试数据,以便id = 66272的人在第一个电话(8​​月10日和19日)之后的20天内拥有第二个电话:

      id      call_starttime  level
0  66547 2015-07-28 23:18:00      1
1  66272 2015-08-10 20:48:00      0
2  66547 2015-08-20 17:32:00      2
3  66272 2015-08-19 18:21:00      0
4  66272 2015-08-31 20:25:00      0

The first step is to define a function stating whether the current person is "active" (he did the second call in 20 days from the first): 第一步是定义一个函数,说明当前人员是否处于“活动状态”(他从第一个呼叫开始的20天之内进行了第二个呼叫):

def active(grp):
    if grp.shape[0] < 2:
        return False  # Single call
    d0 = grp.call_starttime.iloc[0]
    d1 = grp.call_starttime.iloc[1]
    return (d1 - d0).days < 20

This function will be applied to each group of rows (for each person). 此功能将应用于每组行(针对每个人)。

To get detailed information on activity of each person, you can run: 要获取有关每个人的活动的详细信息,可以运行:

df.groupby('id').apply(active)

For my sample data the result is: 对于我的样本数据,结果是:

id
66272     True
66547    False
dtype: bool

But if you are interested only in the number of active people, use np.count_nonzero on the above result: 但是,如果你有兴趣只在活跃的人的数量 ,使用np.count_nonzero以上的结果:

np.count_nonzero(df.groupby('id').apply(active))

For my sample data the result is 1 . 对于我的样本数据,结果为1

If you want the percentage of active people, divide this number by df.id.unique().size (multipied by 100, to express the result in percents). 如果您想要活跃的人的百分比 ,请将该数字除以df.id.unique()。size (乘以100,以百分比表示结果)。

And now, how to change this solution to check whether a person has made at least 10 calls in initial 20 days: 现在,如何更改此解决方案以检查一个人在最初的20天内是否至少拨打了10次电话:

The only difference is that active function should compare dates of calls No 0 and 9 . 唯一的区别是, 活动函数应该比较调用09的日期。

So this function should be changed to: 因此,此功能应更改为:

def active(grp):
    if grp.shape[0] < 10:
        return False  # Too little calls
    d0 = grp.call_starttime.iloc[0]
    d1 = grp.call_starttime.iloc[9]
    return (d1 - d0).days < 20

I assume that source rows are ordered by call_starttime . 我假设源行按call_starttime排序 If this is not the case, call sort_values(by='call_starttime') before. 如果不是这种情况,请在之前调用sort_values(by ='call_starttime')

Edit following your comment 根据您的评论进行编辑

I came up with another solution including grouping by level column, with no requirements on source data sort and with easy parametrization concerning numbers of initial days and calls in this period. 我想出了另一种解决方案,包括按级别列分组,对源数据排序没有要求,并且在此期间的初始天数和调用次数都易于进行参数化。

Test DataFrame: 测试数据框:

      id      call_starttime  level
0  66547 2015-07-28 23:18:00      1
1  66272 2015-08-10 19:48:00      0
2  66547 2015-08-20 17:32:00      1
3  66272 2015-08-19 18:21:00      0
4  66272 2015-08-29 20:25:00      0
5  66777 2015-08-30 20:00:00      0

Level 0 contains one person with 3 calls within first 20 days (August 10, 19 and 29). 级别0包含一个在开始的20天内(8月10日,19日和29日)有3个呼叫的人。 Note however that the last call has later hour than the first, so actually these 2 TimeStamps are more than 19 days apart, but since my solution clears the time component, this last call will be accounted for. 但是请注意,最后一次呼叫的时间比第一次呼叫晚,因此实际上这两个TimeStamp相隔19天以上 ,但是由于我的解决方案清除了时间部分,因此考虑最后一次呼叫。

Start from defining a function: 从定义函数开始:

def activity(grp, dayNo):
    stDates = grp.dt.floor('d')  # Delete time component
    # Leave dates from starting "dayNo" days
    stDates = stDates[stDates < stDates.min() + pd.offsets.Day(dayNo)]
    return stDates.size

giving the number of calls by particular person (group of call_starttime values) within first dayNo days. 给出第一天天内特定人员的呼叫数量( call_starttime值组)。

The next function to define is: 下一个要定义的功能是:

def percentage(s, callNo):
    return s[s >= callNo].size * 100 / s.size

counting the percentage of values in s (a Series for the current level ) which are >= callNo . 计算s (当前级别Series )中大于等于callNo的值的百分比。

The first processing step is to compute a Series - number of calls, within the defined "starting period", for each level / id : 第一处理步骤是计算一个系列 -呼叫数,所定义的“起动期间”内,对于每个电平 / ID:

calls = df.groupby(['level', 'id']).call_starttime.apply(activity, dayNo=20)

The result (for my data) is: 结果(针对我的数据)为:

level  id   
0      66272    3
       66777    1
1      66547    1
Name: call_starttime, dtype: int64

To get the final result (percentages for each level , assuming the requirement to make 3 calls), run: 要获取最终结果(假设需要进行3次调用,则为每个级别的百分比),请运行:

calls.groupby(level=0).apply(percentage, callNo=3)

Note that level=0 above is a reference to the MultiIndex level , not to the column name. 请注意,上面的level = 0是对MultiIndex级别的引用,而不是对列名的引用。

The result (again for my data) is: 结果(再次用于我的数据)为:

level
0    50.0
1     0.0
Name: call_starttime, dtype: float64

Level 0 has one person meeting the criterion (of total 2 people at this level) so the percentage is 50 and at level 1 nobody meets the criterion, so the percentage is 0 . 级别0拥有一个满足条件的人员(此级别共有2位人员),所以该百分比为50,而在级别1则没有人满足条件,因此该百分比为0

Note that dayNo and callNo parameters allow easy parametrization concerning the length of the "initial period" for each person and the number of calls to be made in this period. 请注意,使用dayNocallNo参数可以很容易地进行参数化,涉及每个人的“初始时间段”的长度以及该时间段内要拨打的电话数。

The computation desrcibed above is for 3 calls, but in your case change callNo to your value, ie 10 . 上面描述的计算是针对3个调用的,但是在您的情况下, 请将callNo更改为您的值,即10

As you can see this solution is quite short (only 8 lines of code), much shorter and much more "Pandasonic" than the other solution. 如您所见,该解决方案很短(只有8行代码),比其他解决方案要短得多,并且“ Pandasonic”要多得多。

If you prefer a "terse" coding style, you can also do the whole computation in a single (although significantly chained) instruction: 如果你喜欢一个“简洁”的编码风格,你也可以做一个单一的 (虽然显著链接)指令的整个计算:

df.groupby(['level', 'id']).call_starttime\
    .apply(activity, dayNo=20).rename('Percentage')\
    .groupby(level=0).apply(percentage, callNo=3)

I added .rename('Percentage') to change the name of the result Series . 我添加了.rename('Percentage')来更改结果Series的名称。

I used a Person Class to help solve this problem. 我使用了Person类来帮助解决此问题。

  1. Created a dataframe 创建一个数据框
  2. Changed call_start_time from String to TimeDelta format 将call_start_time从字符串更改为TimeDelta格式
  3. Retrieved 20 days date after FIRST call_start_time 在FIRST call_start_time之后的20天检索
  4. Created Person class to keep track of days_count and id 创建了Person类来跟踪days_count和id
  5. Created a list to hold Person objects and populated the objects with data from dataframe 创建了一个列表来保存Person对象,并使用dataframe中的数据填充这些对象
  6. Print list of Persons objects if they have hit 10+ call sales within the 20 day time frame from start_date to end_date 如果从开始日期到结束日期的20天内,销售人员达到10次以上的销售量,则打印人员对象列表

I have tested my code and it works good. 我已经测试过我的代码,并且效果很好。 There can be improvements but my main focus is achieving a good working solution. 可以进行改进,但是我的主要重点是实现良好的工作解决方案。 Let me know if you have any questions. 如果您有任何疑问,请告诉我。

import pandas as pd
from datetime import timedelta
import datetime
import numpy as np

# prep data for dataframe
lst = {'call_start_time':['7/28/2015','8/10/2015','7/28/2015','7/28/2015'],
        'level':['1','0','1','1'],
        'id':['66547', '66272', '66547','66547']}

# create dataframe
df = pd.DataFrame(lst)

# convert to TimeDelta object to subtract days
for index, row in df.iterrows():
    row['call_start_time'] = datetime.datetime.strptime(row['call_start_time'], "%m/%d/%Y").date()

# get the end date by adding 20 days to start day
df["end_of_20_days"] = df["call_start_time"] + timedelta(days=20)

# used below comment for testing might need it later
# df['Difference'] = (df['end_of_20_days'] - df['call_start_time']).dt.days

# created person class to keep track of days_count and id
class Person(object):
    def __init__(self, id, start_date, end_date):
        self.id = id
        self.start_date = start_date
        self.end_date = end_date
        self.days_count = 1

# create list to hold objects of person class
person_list = []

# populate person_list with Person objects and their attributes
for index, row in df.iterrows():
    # get result_id to use as conditional for populating Person objects
    result_id = any(x.id == row['id'] for x in person_list)

    # initialize Person objects and inject with data from dataframe
    if len(person_list) == 0:
        person_list.append(Person(row['id'], row['call_start_time'], row['end_of_20_days']))
    elif not(result_id):
        person_list.append(Person(row['id'], row['call_start_time'], row['end_of_20_days']))
    else:
        for x in person_list:
            # if call_start_time is within 20 days time frame, increment day_count to Person object
            diff = (x.end_date - row['call_start_time']).days
            if x.id == row['id'] and diff <= 20 :
                x.days_count += 1
                break

# flag to check if nobody hit the sales mark
flag = 0

# print out only person_list ids who have hit the sales mark
for person in person_list:
    if person.days_count >= 10:
        flag = 1
        print("person id:{} has made {} calls within the past 20 days since first call date".format(person.id, person.days_count))

if flag == 0:
    print("No one has hit the sales mark")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM