使用 Pandas（或 numpy）比较两列并计算百分比差异

Question

免责声明：我正在学习在 Python 中进行开发，我知道这种编码方式可能就像垃圾一样，但我计划在创建程序的同时不断改进。

所以我正在尝试构建一个爬虫来每天使用 Selenium 检查特定的航班价格，并且这部分代码已经完成。 始发地、目的地、首飞日期、二飞日期和价格将每天保存。 我将这些数据保存到一个文件中，然后比较价格是否有任何变化。

我的目标是确定价格变化是否超过 X 个百分比，然后在每个比较航班的脚本中打印一条消息。

import pandas as pd
import os.path
import numpy as np

#This are just sample data before integrating Selenium values
price = 230
departuredate = '20/02/2020'
returndate = '20/02/2020'
fromm = 'BOS'
to = 'JFK'

price2 = 630
departuredate2 = '20/02/2020'
returndate2 = '20/02/2020'
fromm2= 'CDG'
to2= 'JFK'
#End of sample data


flightdata = {'From': [fromm, fromm2], 'To': [to,to2], 'Departure date': [departuredate,departuredate2], 'Return date': [returndate,returndate2], 'Price': [price,price2]}

df = pd.DataFrame(flightdata, columns= ['From', 'To', 'Departure date', 'Return date', 'Price'])


#Check if the script is running for the first time
if os.path.exists('flightstoday.xls') == True:
 os.remove("flightsyesterday.xls")
 os.rename('flightstoday.xls', 'flightsyesterday.xls') #Rename the flights scraped fromm yesterday
 df.to_csv('flightstoday.xls', mode='a', header=True, sep='\t')
else:
 df.to_csv('flightstoday.xls', mode='w', header=True, sep='\t')

#Work with two dataframes
flightsyesterday = pd.read_csv("flightsyesterday.xls",sep='\t') 
flightstoday = pd.read_csv("flightstoday.xls",sep='\t')

我缺少的是如何比较“价格”列并打印一条消息，说明对于具有“从”、“至”、“出发日期”、“返回日期”的行 X，航班已更改 X 百分比.

我已经尝试过这段代码，但它只在flightstoday文件中添加了一列，而不是百分比，当然也不会打印价格有任何变化。

flightstoday['PriceDiff'] = np.where(vueloshoy['Price'] == vuelosayer['Price'], 0, vueloshoy['Price'] - vuelosayer['Price'])

对这个新手的任何帮助将不胜感激。 谢谢！

Answer 1

从我收集到的信息来看，我认为这就是你打算做的。

import pandas as pd
import os.path
import numpy as np

# This are just sample data before integrating Selenium values
price = 230
departuredate = '20/02/2020'
returndate = '20/02/2020'
fromm = 'BOS'
to = 'JFK'

price2 = 630
departuredate2 = '20/02/2020'
returndate2 = '20/02/2020'
fromm2 = 'CDG'
to2 = 'JFK'

# Create second set of prices
price3 = 250
price4 = 600

# Generate data to construct DataFrames
today_flightdata = {'From': [fromm, fromm2], 'To': [to, to2], 'Departure date': [
    departuredate, departuredate2], 'Return date': [returndate, returndate2], 'Price': [price, price2]}
yesterday_flightdata = {'From': [fromm, fromm2], 'To': [to, to2], 'Departure date': [
    departuredate, departuredate2], 'Return date': [returndate, returndate2], 'Price': [price3, price4]}

# Create dataframes for yesterday and today
today = pd.DataFrame(today_flightdata, columns=[
                     'From', 'To', 'Departure date', 'Return date', 'Price'])
yesterday = pd.DataFrame(yesterday_flightdata, columns=[
                         'From', 'To', 'Departure date', 'Return date', 'Price'])

# Determine changes
today['price_change'] = (
    today['Price'] - yesterday['Price']) / yesterday['Price'] * 100.

# Determine indices of all rows where price_change > threshold
threshold = 1.0
today['exceeds_threshold'] = abs(today['price_change']) >= threshold
exceed_indices = today['exceeds_threshold'][today['exceeds_threshold']].index

# Print out those entries that exceed threshold
for idx in exceed_indices:
    row = today.iloc[idx]
    print('Flight from {} to {} leaving on {} and returning on {} has changed by {}%'.format(
        row['From'], row['To'], row['Departure date'], row['Return date'], row['price_change']))

Output：

Flight from CDG to JFK leaving on 20/02/2020 and returning on 20/02/2020 has changed by 5.0%

我从这篇文章中学习了计算exceed_indices的语法

使用 Pandas（或 numpy）比较两列并计算百分比差异

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-06-19 20:57:21

使用 Pandas（或 numpy）比较两列并计算百分比差异

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-06-19 20:57:21

解决方案1
1 已采纳 2020-06-19 20:57:21