简体   繁体   English

使用 Pandas(或 numpy)比较两列并计算百分比差异

[英]Comparing two columns using Pandas (or numpy) and calculate percentage difference

Disclaimer: I'm learning to develop in Python and I know that way of coding is probably like trash but I plan to keep improving while creating programs.免责声明:我正在学习在 Python 中进行开发,我知道这种编码方式可能就像垃圾一样,但我计划在创建程序的同时不断改进。

So I'm trying to build a scraper to check for specific flights prices daily with Selenium and that part of the code is already done.所以我正在尝试构建一个爬虫来每天使用 Selenium 检查特定的航班价格,并且这部分代码已经完成。 Origin, destination, first flight date, second flight date and price will be saved every day.始发地、目的地、首飞日期、二飞日期和价格将每天保存。 I'm saving those data into a file and then comparing if there were any changes in price.我将这些数据保存到一个文件中,然后比较价格是否有任何变化。

My aim is to make if there is change in price by more than an X percentage and then to print a message into the script for every compared flight.我的目标是确定价格变化是否超过 X 个百分比,然后在每个比较航班的脚本中打印一条消息。

import pandas as pd
import os.path
import numpy as np

#This are just sample data before integrating Selenium values
price = 230
departuredate = '20/02/2020'
returndate = '20/02/2020'
fromm = 'BOS'
to = 'JFK'

price2 = 630
departuredate2 = '20/02/2020'
returndate2 = '20/02/2020'
fromm2= 'CDG'
to2= 'JFK'
#End of sample data


flightdata = {'From': [fromm, fromm2], 'To': [to,to2], 'Departure date': [departuredate,departuredate2], 'Return date': [returndate,returndate2], 'Price': [price,price2]}

df = pd.DataFrame(flightdata, columns= ['From', 'To', 'Departure date', 'Return date', 'Price'])


#Check if the script is running for the first time
if os.path.exists('flightstoday.xls') == True:
 os.remove("flightsyesterday.xls")
 os.rename('flightstoday.xls', 'flightsyesterday.xls') #Rename the flights scraped fromm yesterday
 df.to_csv('flightstoday.xls', mode='a', header=True, sep='\t')
else:
 df.to_csv('flightstoday.xls', mode='w', header=True, sep='\t')

#Work with two dataframes
flightsyesterday = pd.read_csv("flightsyesterday.xls",sep='\t') 
flightstoday = pd.read_csv("flightstoday.xls",sep='\t')

What I'm missing is how to compare the column 'Price' and print a message saying that for the row X with 'From', 'To', 'Departure date', 'Return date' the flight has changed by an X percentage.我缺少的是如何比较“价格”列并打印一条消息,说明对于具有“从”、“至”、“出发日期”、“返回日期”的行 X,航班已更改 X 百分比.

I have tried this code but it only adds a column to flighstoday file but not the percentage and of course doesn't print there was any change in price.我已经尝试过这段代码,但它只在flightstoday文件中添加了一列,而不是百分比,当然也不会打印价格有任何变化。

flightstoday['PriceDiff'] = np.where(vueloshoy['Price'] == vuelosayer['Price'], 0, vueloshoy['Price'] - vuelosayer['Price'])

Any help for this newbie will be greatly appreciated.对这个新手的任何帮助将不胜感激。 Thank you!谢谢!

From what I've gathered, I think this is what you're intending to do.从我收集到的信息来看,我认为这就是你打算做的。

import pandas as pd
import os.path
import numpy as np

# This are just sample data before integrating Selenium values
price = 230
departuredate = '20/02/2020'
returndate = '20/02/2020'
fromm = 'BOS'
to = 'JFK'

price2 = 630
departuredate2 = '20/02/2020'
returndate2 = '20/02/2020'
fromm2 = 'CDG'
to2 = 'JFK'

# Create second set of prices
price3 = 250
price4 = 600

# Generate data to construct DataFrames
today_flightdata = {'From': [fromm, fromm2], 'To': [to, to2], 'Departure date': [
    departuredate, departuredate2], 'Return date': [returndate, returndate2], 'Price': [price, price2]}
yesterday_flightdata = {'From': [fromm, fromm2], 'To': [to, to2], 'Departure date': [
    departuredate, departuredate2], 'Return date': [returndate, returndate2], 'Price': [price3, price4]}

# Create dataframes for yesterday and today
today = pd.DataFrame(today_flightdata, columns=[
                     'From', 'To', 'Departure date', 'Return date', 'Price'])
yesterday = pd.DataFrame(yesterday_flightdata, columns=[
                         'From', 'To', 'Departure date', 'Return date', 'Price'])

# Determine changes
today['price_change'] = (
    today['Price'] - yesterday['Price']) / yesterday['Price'] * 100.

# Determine indices of all rows where price_change > threshold
threshold = 1.0
today['exceeds_threshold'] = abs(today['price_change']) >= threshold
exceed_indices = today['exceeds_threshold'][today['exceeds_threshold']].index

# Print out those entries that exceed threshold
for idx in exceed_indices:
    row = today.iloc[idx]
    print('Flight from {} to {} leaving on {} and returning on {} has changed by {}%'.format(
        row['From'], row['To'], row['Departure date'], row['Return date'], row['price_change']))

Output: Output:

Flight from CDG to JFK leaving on 20/02/2020 and returning on 20/02/2020 has changed by 5.0%

I learned the syntax to calculate exceed_indices from this post我从这篇文章中学习了计算exceed_indices的语法

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM