简体   繁体   中英

Python function for Count of date of month number within a dataframe

Edited to add payload example and complete script Edited again to modify the script and format my problem better

I am creating a script to analyse payment cycles from a bank statement of multiple payments. I am working out the most frequent day of week and date of month and selecting the highest as either day of week along with its position and frequency of payments within a given month or a specific date.

Where its a specific date, I expect the second and third highest dates to be either side of where the highest used date falls on a weekend or on a public holiday.

I created a function to do a count of the days in a week without needing to sort the dataframe and to allow me to add that value as a column in the dataframe rather than it ending up as a list.

What I need help on?

That was fine for working through 7 days and counting based on filters but, when doing it by the date of the month, my function for it has 31 if statements. How can I make this more concise and gain the same outcome?

Also, I have the problem with: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using.loc[row_indexer,col_indexer] = value instead SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using.loc[row_indexer,col_indexer] = value instead which I simply can't get to go away. Copy vs View, I really don't mind whether it uses a copy or view, I just want to get rid of the warming one way or the other.

Script and Example Data

Below is the part of the script I need to make tidier:

# function to filter by date of month to perform a count of each

import pandas as pd
from datetime import datetime

df = pd.read_csv(r"C:\Users\mattl\OneDrive\Desktop\netflix - only.csv")

# Convert to a Date format here
df['new_date']=df['date'].apply(lambda x: datetime.strptime(x, '%d/%m/%Y'))

# Extend data frame with month, day of month and week day
df['month'] = df['new_date'].apply(lambda x: x.month)
df['dom'] = df['new_date'].apply(lambda x: x.day)
df['dow']=df['new_date'].apply(lambda x: x.strftime("%A"))

# function to filter by weekday to perform a count of each
def totalForWeekDay(weekDay):
    filter = df.dow.value_counts();
    #print(filter);
    if weekDay == 'Sunday':
        return filter['Sunday'];
    if weekDay == 'Monday':
        return filter['Monday'];
    if weekDay == 'Tuesday':
        return filter['Tuesday'];
    if weekDay == 'Wednesday':
        return filter['Wednesday'];
    if weekDay == 'Thursday':
        return filter['Thursday'];
    if weekDay == 'Friday':
        return filter['Friday'];
    if weekDay == 'Saturday':
        return filter['Saturday'];

# function to filter by date of month to perform a count of each
def totalForMonthDate(monthDay):
    filter = df.dom.value_counts();
    #print(filter);
    if monthDay == 31:
        return filter[31];
    if monthDay == 30:
        return filter[30];
    if monthDay == 29:
        return filter[29];
    if monthDay == 28:
        return filter[28];
    if monthDay == 27:
        return filter[27];
    if monthDay == 26:
        return filter[26];
    if monthDay == 25:
        return filter[25];
    if monthDay == 24:
        return filter[24];
    if monthDay == 23:
        return filter[23];
    if monthDay == 22:
        return filter[22];
    if monthDay == 21:
        return filter[21];
    if monthDay == 20:
        return filter[20];
    if monthDay == 19:
        return filter[19];
    if monthDay == 18:
        return filter[18];
    if monthDay == 17:
        return filter[17];
    if monthDay == 16:
        return filter[16];
    if monthDay == 15:
        return filter[15];
    if monthDay == 14:
        return filter[14];
    if monthDay == 13:
        return filter[13];
    if monthDay == 12:
        return filter[12];
    if monthDay == 11:
        return filter[11];
    if monthDay == 10:
        return filter[10];
    if monthDay == 9:
        return filter[9];
    if monthDay == 8:
        return filter[8];
    if monthDay == 7:
        return filter[7];
    if monthDay == 6:
        return filter[6];
    if monthDay == 5:
        return filter[5];
    if monthDay == 4:
        return filter[4];
    if monthDay == 3:
        return filter[3];
    if monthDay == 2:
        return filter[2];
    if monthDay == 1:
        return filter[1];

# Add column which calls the function resulting it total count of week_day
df['dow_total'] = df['dow'].apply(lambda row: totalForWeekDay(row));

# Add formula and column to dataframe which counts the month number
df['dom_total'] = df['dom'].apply(lambda row: totalForMonthDate(row));

# Show results
print(df)

if df["dom_total"].max() >= df["dow_total"].max():
    # Determine the top day of month result
    top_dom_tot = df.loc[df['dom_total'] == df['dom_total'].max()]

    # isolate the top day of month
    top_day_of_month = (top_dom_tot['dom'][0])
    print('Top day of month is:')
    print(top_day_of_month)

    # find dates in list where the date of month is NOT the highest number
    dfa = df.loc[df['dom'] != top_day_of_month]

    # Determine number of days forwards (positive) or back (negative)
    dfa['days_diff'] = df['dom'] - top_day_of_month
    print('Payments that are not related to the top day per month')
    print(dfa)

Now for the example csv payload:

type,party,date, debit , credit 
payment,Netflix,22/01/2021,-$19.99, $-   
payment,Netflix,22/02/2021,-$19.99, $-   
payment,Netflix,22/03/2021,-$19.99, $-   
payment,Netflix,22/04/2021,-$19.99, $-   
payment,Netflix,24/05/2021,-$19.99, $-   
payment,Netflix,22/06/2021,-$19.99, $-   
payment,Netflix,22/07/2021,-$19.99, $-   
payment,Netflix,23/08/2021,-$19.99, $-   
payment,Netflix,22/09/2021,-$19.99, $-   
payment,Netflix,22/10/2021,-$19.99, $-   

Thanks for your advice!

  1. In totalForMonthDate() , you can replace this series of if statements with two lines:

     def totalForMonthDate(monthDay): filter = df.dom.value_counts() return filter[monthDay]

    Of course, you're also running value_counts() once for every row in your dataframe, when it's the same for the whole dataframe. That's inefficient. You can replace this by doing value_counts() once and using map to translate the values:

     df['dom_total'] = df['dom'].map(df['dom'].value_counts())

    Not only is this shorter (1 line vs 4 lines) but it's faster too.

  2. You're getting a SettingWithCopyWarning because you're using.loc to filter down the dataframe, then modifying that filtered subset. The simplest way to fix this is to throw in a copy when you're subsetting the dataframe.

     dfa = df.loc[df['dom'].= top_day_of_month].copy()

    Note: the code afterward which adds a new column won't affect the original dataframe.

Here is the full source code:

import pandas as pd
import io
from datetime import datetime

s = """type,party,date, debit , credit 
payment,Netflix,22/01/2021,-$19.99, $-   
payment,Netflix,22/02/2021,-$19.99, $-   
payment,Netflix,22/03/2021,-$19.99, $-   
payment,Netflix,22/04/2021,-$19.99, $-   
payment,Netflix,24/05/2021,-$19.99, $-   
payment,Netflix,22/06/2021,-$19.99, $-   
payment,Netflix,22/07/2021,-$19.99, $-   
payment,Netflix,23/08/2021,-$19.99, $-   
payment,Netflix,22/09/2021,-$19.99, $-   
payment,Netflix,22/10/2021,-$19.99, $-   """

df = pd.read_csv(io.StringIO(s))

# Convert to a Date format here
df['new_date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')

# Extend data frame with month, day of month and week day
df['month'] = df['new_date'].dt.month
df['dom'] = df['new_date'].dt.day
df['dow'] = df['new_date'].dt.strftime("%A")

# Add column which calls the function resulting it total count of week_day
df['dow_total'] = df['dow'].map(df['dow'].value_counts())

# Add formula and column to dataframe which counts the month number
df['dom_total'] = df['dom'].map(df['dom'].value_counts())

# Show results
print(df)

if df["dom_total"].max() >= df["dow_total"].max():
    # Determine the top day of month result
    top_dom_tot = df.loc[df['dom_total'] == df['dom_total'].max()]

    # isolate the top day of month
    top_day_of_month = (top_dom_tot['dom'][0])
    print('Top day of month is:')
    print(top_day_of_month)

    # find dates in list where the date of month is NOT the highest number
    dfa = df.loc[df['dom'] != top_day_of_month].copy()

    # Determine number of days forwards (positive) or back (negative)
    dfa['days_diff'] = df['dom'] - top_day_of_month
    print('Payments that are not related to the top day per month')
    print(dfa)

Anyways, hope that helped. Cool project!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM