简体   繁体   中英

combine two complete rows if certain criteria is met

I've been able to extract data from two separate xlsx and combine them into a single xlsx sheet using pandas.

I know have a table that looks like this.

Home                Start Date     Gross Earning Tax Gross Rental Commission Net Rental
3157    2020-03-26 00:00:00 -268.8  -28.8   -383.8  -36 -338.66
3157    2020-03-26 00:00:00 268.8   28.8    153.8   36  108.66
3157    2020-03-24 00:00:00 264.32  28.32   149.32  35.4    104.93
3157    2020-03-13 00:00:00 625.46  67.01   510.46  83.7675 405.4225
3157    2020-03-13 00:00:00 558.45  0   443.45  83.7675 342.9325
3157    2020-03-11 00:00:00 142.5   0   27.5    21.375  1.855
3157    2020-03-11 00:00:00 159.6   17.1    44.6    21.375  17.805
3157    2020-03-03 00:00:00 349.52  0   234.52  52.428  171.612
3157    2020-03-03 00:00:00 391.46  41.94   276.46  52.428  210.722

在此处输入图片说明

So if you take a look at the first two rows, the name in the Home column is the same (In this example, 3157 Tocoa) but they are also the same for the next few rows. But in the Start date column, only the first two items in that column are the same (In this case 3/26/2020 12:00:00 AM) So what i need to do is the following

If the dates are the same, and the Home is the same, then I need the sum of all of the following columns. (In this case, I would need the sum of -268.8 and 268.8, the sum of -28.8 and 28.8 and so on) It is also important to mention there are instances where there is a total of more than two matching start dates.

I will include the code I have used to get to where I am now, I would like to mention I am fairly new to python so I'm sure there is a way to do this super simple but I am just not familiar. I am also new to stackoverflow so if I am missing something or added something I should have please forgive me

import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
import numpy as np
import matplotlib.pyplot as plt
import os

# class airbnb:

#Gets the location path for the reports that come raw from the channel
airbnb_excel_file = (r'C:\Users\Christopher\PycharmProjects\Reporting with 
python\Data_to_read\Bnb_feb_report.xlsx')
empty_excel_file = (r'C:\Users\Christopher\PycharmProjects\Reporting with 
python\Data_to_read\empty.xlsx')

#Defines the data frame
df_airbnb = pd.read_excel(airbnb_excel_file)
df_empty = pd.read_excel(empty_excel_file)

gross_earnings = df_airbnb['Gross Earnings']
tax_amount = df_airbnb['Gross Earnings'] * 0.06
gross_rental = df_airbnb['Gross Earnings'] - df_airbnb['Cleaning Fee']
com = ((gross_rental - tax_amount) + df_airbnb['Cleaning Fee']) * 0.15
net_rental = (gross_rental - (com + df_airbnb['Host Fee']))
house = df_airbnb['Listing']
start_date = df_airbnb['Start Date']

# df = pd.DataFrame(df_empty)
# df_empty.replace('nan', '')
#
# print(net_rental)

df_report = pd.DataFrame(
    {'Home': house, 'Start Date': start_date, 'Gross Earning': gross_earnings, 'Tax': tax_amount,
     'Gross Rental': gross_rental, 'Commission': com, 'Net Rental': net_rental})

df_report.loc[(df_report.Home == 'New house, Minutes from Disney & Attraction'), 'Home'] = '3161 
Tocoa'
df_report.loc[(df_report.Home == 'Brand-New House, located minutes from Disney 5151'), 'Home'] = 
'5151 Adelaide'
df_report.loc[(df_report.Home == 'Luxury House, Located Minutes from Disney-World 57'), 'Home'] = 
'3157 Tocoa'
df_report.loc[(df_report.Home == 'Big house, Located Minutes from Disney-World 55'), 'Home'] = '3155 
Tocoa'

df_report.sort_values(by=['Home'], inplace=True)

# writer = ExcelWriter('Final_Report.xlsx')
# df_report.to_excel(writer, 'sheet1', index=False)
# writer.save()



# class homeaway:
homeaway_excel_file = (r'C:\Users\Christopher\PycharmProjects\Reporting with 
python\Data_to_read\PayoutSummaryReport2020-03-01_2020-03-29.xlsx')
df_homeaway = pd.read_excel(homeaway_excel_file)

cleaning = int(115)

house = df_homeaway['Address']
start_date = df_homeaway['Check-in']
gross_earnings = df_homeaway['Gross booking amount']
taxed_amount = df_homeaway['Lodging Tax Owner Remits']
gross_rental = (gross_earnings - cleaning)
com = ((gross_rental-taxed_amount) + cleaning) * 0.15
net_rental = (gross_rental - (com + df_homeaway['Deductions']))

df_report2 = pd.DataFrame(
    {'Home': house, 'Start Date': start_date, 'Gross Earning': gross_earnings, 'Tax': taxed_amount,
     'Gross Rental': gross_rental, 'Commission': com, 'Net Rental': net_rental})

# writer = ExcelWriter('Final_Report2.xlsx')
# df_report2.to_excel(writer, 'sheet1', index=False)
# writer.save()


df_combined = pd.concat([df_report, df_report2])

writer = ExcelWriter('Final_Report_combined.xlsx')
df_report2.to_excel(writer, 'sheet1', index=False)
writer.save()

One of possible approaches is to group by Home and Start Date and then compute sum of rows involved:

df.groupby(['Home', 'Start Date']).sum()

Fortunately, all "other" columns are numeric, so no column specification is needed.

But if there are more than 2 rows with same Home and Start Date and you want to:

  • break them into pairs of consecutive rows,
  • and then compute their sums (for each pair separately),

you should apply a "2-tier" grouping:

  • first tier - group by Home and Start Date (as before),
  • second tier - group into pairs,

and compute sums for each second-level group.

In this case the code should be:

df.groupby(['Home', 'Start Date']).apply(
    lambda grp: grp.groupby(np.arange(len(grp.index)) // 2).sum())\
    .reset_index(level=-1, drop=True)

Additional operation required here is to drop the last level of the index ( reset_index ).

To test this approach, eg add the following row to your DataFrame:

1234 Bogus Street,2020-03-26 00:00:00,20.0,2.0,15.0,3,10.0

so that 1234 Bogus Street / 2020-03-26 00:00:00 group now contains three rows.

When you run the above code, you will get:

                                       Gross Earning    Tax  Gross Rental  Commission  Net Rental
Home              Start Date                                                                     
1234 Bogus Street 2020-03-03 00:00:00         740.98  41.94        510.98     104.856     382.334
                  2020-03-11 00:00:00         302.10  17.10         72.10      42.750      19.660
                  2020-03-13 00:00:00        1183.91  67.01        953.91     167.535     748.355
                  2020-03-24 00:00:00         264.32  28.32        149.32      35.400     104.930
                  2020-03-26 00:00:00           0.00   0.00       -230.00       0.000    -230.000
                  2020-03-26 00:00:00          20.00   2.00         15.00       3.000      10.000

Note the last row. It contains:

  • repeated Start Date (from the previous row),
  • values from the added row.

And the last but one row contains sums for only two first rows with respective Home / Start Date .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM