简体   繁体   中英

Comparing two columns using Pandas (or numpy) and calculate percentage difference

Disclaimer: I'm learning to develop in Python and I know that way of coding is probably like trash but I plan to keep improving while creating programs.

So I'm trying to build a scraper to check for specific flights prices daily with Selenium and that part of the code is already done. Origin, destination, first flight date, second flight date and price will be saved every day. I'm saving those data into a file and then comparing if there were any changes in price.

My aim is to make if there is change in price by more than an X percentage and then to print a message into the script for every compared flight.

import pandas as pd
import os.path
import numpy as np

#This are just sample data before integrating Selenium values
price = 230
departuredate = '20/02/2020'
returndate = '20/02/2020'
fromm = 'BOS'
to = 'JFK'

price2 = 630
departuredate2 = '20/02/2020'
returndate2 = '20/02/2020'
fromm2= 'CDG'
to2= 'JFK'
#End of sample data


flightdata = {'From': [fromm, fromm2], 'To': [to,to2], 'Departure date': [departuredate,departuredate2], 'Return date': [returndate,returndate2], 'Price': [price,price2]}

df = pd.DataFrame(flightdata, columns= ['From', 'To', 'Departure date', 'Return date', 'Price'])


#Check if the script is running for the first time
if os.path.exists('flightstoday.xls') == True:
 os.remove("flightsyesterday.xls")
 os.rename('flightstoday.xls', 'flightsyesterday.xls') #Rename the flights scraped fromm yesterday
 df.to_csv('flightstoday.xls', mode='a', header=True, sep='\t')
else:
 df.to_csv('flightstoday.xls', mode='w', header=True, sep='\t')

#Work with two dataframes
flightsyesterday = pd.read_csv("flightsyesterday.xls",sep='\t') 
flightstoday = pd.read_csv("flightstoday.xls",sep='\t')

What I'm missing is how to compare the column 'Price' and print a message saying that for the row X with 'From', 'To', 'Departure date', 'Return date' the flight has changed by an X percentage.

I have tried this code but it only adds a column to flighstoday file but not the percentage and of course doesn't print there was any change in price.

flightstoday['PriceDiff'] = np.where(vueloshoy['Price'] == vuelosayer['Price'], 0, vueloshoy['Price'] - vuelosayer['Price'])

Any help for this newbie will be greatly appreciated. Thank you!

From what I've gathered, I think this is what you're intending to do.

import pandas as pd
import os.path
import numpy as np

# This are just sample data before integrating Selenium values
price = 230
departuredate = '20/02/2020'
returndate = '20/02/2020'
fromm = 'BOS'
to = 'JFK'

price2 = 630
departuredate2 = '20/02/2020'
returndate2 = '20/02/2020'
fromm2 = 'CDG'
to2 = 'JFK'

# Create second set of prices
price3 = 250
price4 = 600

# Generate data to construct DataFrames
today_flightdata = {'From': [fromm, fromm2], 'To': [to, to2], 'Departure date': [
    departuredate, departuredate2], 'Return date': [returndate, returndate2], 'Price': [price, price2]}
yesterday_flightdata = {'From': [fromm, fromm2], 'To': [to, to2], 'Departure date': [
    departuredate, departuredate2], 'Return date': [returndate, returndate2], 'Price': [price3, price4]}

# Create dataframes for yesterday and today
today = pd.DataFrame(today_flightdata, columns=[
                     'From', 'To', 'Departure date', 'Return date', 'Price'])
yesterday = pd.DataFrame(yesterday_flightdata, columns=[
                         'From', 'To', 'Departure date', 'Return date', 'Price'])

# Determine changes
today['price_change'] = (
    today['Price'] - yesterday['Price']) / yesterday['Price'] * 100.

# Determine indices of all rows where price_change > threshold
threshold = 1.0
today['exceeds_threshold'] = abs(today['price_change']) >= threshold
exceed_indices = today['exceeds_threshold'][today['exceeds_threshold']].index

# Print out those entries that exceed threshold
for idx in exceed_indices:
    row = today.iloc[idx]
    print('Flight from {} to {} leaving on {} and returning on {} has changed by {}%'.format(
        row['From'], row['To'], row['Departure date'], row['Return date'], row['price_change']))

Output:

Flight from CDG to JFK leaving on 20/02/2020 and returning on 20/02/2020 has changed by 5.0%

I learned the syntax to calculate exceed_indices from this post

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM