简体   繁体   中英

creating a loop to find out number of sales within the first 20 days

I am a newbie to py and cannot figure out how to find the number of sales calls 20 days after the FIRST sale. The question is asking me to figure out the percent of sales people who made at least 10 sales calls in their first 20 days. Each row is a sales call and the salespeople are represented by the col id , the sales call time in recorded in call_starttime .

The df is fairly simple and looks like this

    id      call_starttime  level
0   66547   7/28/2015 23:18 1
1   66272   8/10/2015 20:48 0
2   66547   8/20/2015 17:32 2
3   66272   8/31/2015 18:21 0
4   66272   8/31/2015 20:25 0

I already have counted the number of convos per id and can have filtered out anyone who has not made at least 10 salescall

The code is am currently using is

df_withcount=df.groupby(['cc_user_id','cc_cohort']).size().reset_index(name='count')
df_20andmore=df_withcount.loc[(df_withcount['count'] >= 20)]

I expect the output to give me the number of ids (sales people) who in their first 20 days made at least 10 calls. As of now I can only figure out how to do made at least 10 calls over all time

I assume that call_starttime column is of DateTime type.

Let's start from a simplified solution, checking only the second call (not 10 subsequent calls).

I changed slightly your test data, so that person with id = 66272 has the second call within 20 days after the first (August 10 and 19):

      id      call_starttime  level
0  66547 2015-07-28 23:18:00      1
1  66272 2015-08-10 20:48:00      0
2  66547 2015-08-20 17:32:00      2
3  66272 2015-08-19 18:21:00      0
4  66272 2015-08-31 20:25:00      0

The first step is to define a function stating whether the current person is "active" (he did the second call in 20 days from the first):

def active(grp):
    if grp.shape[0] < 2:
        return False  # Single call
    d0 = grp.call_starttime.iloc[0]
    d1 = grp.call_starttime.iloc[1]
    return (d1 - d0).days < 20

This function will be applied to each group of rows (for each person).

To get detailed information on activity of each person, you can run:

df.groupby('id').apply(active)

For my sample data the result is:

id
66272     True
66547    False
dtype: bool

But if you are interested only in the number of active people, use np.count_nonzero on the above result:

np.count_nonzero(df.groupby('id').apply(active))

For my sample data the result is 1 .

If you want the percentage of active people, divide this number by df.id.unique().size (multipied by 100, to express the result in percents).

And now, how to change this solution to check whether a person has made at least 10 calls in initial 20 days:

The only difference is that active function should compare dates of calls No 0 and 9 .

So this function should be changed to:

def active(grp):
    if grp.shape[0] < 10:
        return False  # Too little calls
    d0 = grp.call_starttime.iloc[0]
    d1 = grp.call_starttime.iloc[9]
    return (d1 - d0).days < 20

I assume that source rows are ordered by call_starttime . If this is not the case, call sort_values(by='call_starttime') before.

Edit following your comment

I came up with another solution including grouping by level column, with no requirements on source data sort and with easy parametrization concerning numbers of initial days and calls in this period.

Test DataFrame:

      id      call_starttime  level
0  66547 2015-07-28 23:18:00      1
1  66272 2015-08-10 19:48:00      0
2  66547 2015-08-20 17:32:00      1
3  66272 2015-08-19 18:21:00      0
4  66272 2015-08-29 20:25:00      0
5  66777 2015-08-30 20:00:00      0

Level 0 contains one person with 3 calls within first 20 days (August 10, 19 and 29). Note however that the last call has later hour than the first, so actually these 2 TimeStamps are more than 19 days apart, but since my solution clears the time component, this last call will be accounted for.

Start from defining a function:

def activity(grp, dayNo):
    stDates = grp.dt.floor('d')  # Delete time component
    # Leave dates from starting "dayNo" days
    stDates = stDates[stDates < stDates.min() + pd.offsets.Day(dayNo)]
    return stDates.size

giving the number of calls by particular person (group of call_starttime values) within first dayNo days.

The next function to define is:

def percentage(s, callNo):
    return s[s >= callNo].size * 100 / s.size

counting the percentage of values in s (a Series for the current level ) which are >= callNo .

The first processing step is to compute a Series - number of calls, within the defined "starting period", for each level / id :

calls = df.groupby(['level', 'id']).call_starttime.apply(activity, dayNo=20)

The result (for my data) is:

level  id   
0      66272    3
       66777    1
1      66547    1
Name: call_starttime, dtype: int64

To get the final result (percentages for each level , assuming the requirement to make 3 calls), run:

calls.groupby(level=0).apply(percentage, callNo=3)

Note that level=0 above is a reference to the MultiIndex level , not to the column name.

The result (again for my data) is:

level
0    50.0
1     0.0
Name: call_starttime, dtype: float64

Level 0 has one person meeting the criterion (of total 2 people at this level) so the percentage is 50 and at level 1 nobody meets the criterion, so the percentage is 0 .

Note that dayNo and callNo parameters allow easy parametrization concerning the length of the "initial period" for each person and the number of calls to be made in this period.

The computation desrcibed above is for 3 calls, but in your case change callNo to your value, ie 10 .

As you can see this solution is quite short (only 8 lines of code), much shorter and much more "Pandasonic" than the other solution.

If you prefer a "terse" coding style, you can also do the whole computation in a single (although significantly chained) instruction:

df.groupby(['level', 'id']).call_starttime\
    .apply(activity, dayNo=20).rename('Percentage')\
    .groupby(level=0).apply(percentage, callNo=3)

I added .rename('Percentage') to change the name of the result Series .

I used a Person Class to help solve this problem.

  1. Created a dataframe
  2. Changed call_start_time from String to TimeDelta format
  3. Retrieved 20 days date after FIRST call_start_time
  4. Created Person class to keep track of days_count and id
  5. Created a list to hold Person objects and populated the objects with data from dataframe
  6. Print list of Persons objects if they have hit 10+ call sales within the 20 day time frame from start_date to end_date

I have tested my code and it works good. There can be improvements but my main focus is achieving a good working solution. Let me know if you have any questions.

import pandas as pd
from datetime import timedelta
import datetime
import numpy as np

# prep data for dataframe
lst = {'call_start_time':['7/28/2015','8/10/2015','7/28/2015','7/28/2015'],
        'level':['1','0','1','1'],
        'id':['66547', '66272', '66547','66547']}

# create dataframe
df = pd.DataFrame(lst)

# convert to TimeDelta object to subtract days
for index, row in df.iterrows():
    row['call_start_time'] = datetime.datetime.strptime(row['call_start_time'], "%m/%d/%Y").date()

# get the end date by adding 20 days to start day
df["end_of_20_days"] = df["call_start_time"] + timedelta(days=20)

# used below comment for testing might need it later
# df['Difference'] = (df['end_of_20_days'] - df['call_start_time']).dt.days

# created person class to keep track of days_count and id
class Person(object):
    def __init__(self, id, start_date, end_date):
        self.id = id
        self.start_date = start_date
        self.end_date = end_date
        self.days_count = 1

# create list to hold objects of person class
person_list = []

# populate person_list with Person objects and their attributes
for index, row in df.iterrows():
    # get result_id to use as conditional for populating Person objects
    result_id = any(x.id == row['id'] for x in person_list)

    # initialize Person objects and inject with data from dataframe
    if len(person_list) == 0:
        person_list.append(Person(row['id'], row['call_start_time'], row['end_of_20_days']))
    elif not(result_id):
        person_list.append(Person(row['id'], row['call_start_time'], row['end_of_20_days']))
    else:
        for x in person_list:
            # if call_start_time is within 20 days time frame, increment day_count to Person object
            diff = (x.end_date - row['call_start_time']).days
            if x.id == row['id'] and diff <= 20 :
                x.days_count += 1
                break

# flag to check if nobody hit the sales mark
flag = 0

# print out only person_list ids who have hit the sales mark
for person in person_list:
    if person.days_count >= 10:
        flag = 1
        print("person id:{} has made {} calls within the past 20 days since first call date".format(person.id, person.days_count))

if flag == 0:
    print("No one has hit the sales mark")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM