[英]Pandas dataframe, organization and calculation
I need help - I've spent all day (14+ hours) trying to set up a panda dataframe for a test file that I made. 我需要帮助-我花了一整天(超过14个小时)尝试为我制作的测试文件设置熊猫数据框。 My real file is a csv that is several million lines long so I am trying to find the fastest and most effective way of handling the data. 我的真实文件是一个csv,它的长度为几百万行,因此,我试图找到处理数据的最快,最有效的方法。 What I need to do is calculate year over year changes in prices for a list of items. 我需要做的是计算项目清单价格的逐年变化。
The data I have looks like this after I drop the unneeded columns: 删除不需要的列后,数据如下所示:
Item Price As of Date
Item 1 1.08908 4/13/2016
Item 2 2.03281 4/13/2016
Item 3 3.02619 4/13/2016
Item 1 1.56743 12/21/2015
Item 3 12.31867 12/21/2015
Item 2 0.98066 12/21/2015
Item 4 0.31701 12/21/2015
Item 3 0.6251 3/31/2015
Item 1 6.87538 3/31/2015
Item 2 0.3113 3/31/2015
Item 4 0.18724 3/31/2015
First, I need to get the data into a way that I can make the year over year calculation. 首先,我需要将数据转化为可以进行逐年计算的方式。 It is arranged with columns for the Item, Price, and the As of Date. 它按项目,价格和截止日期的列排列。 I need to somehow arrange the data and calculate for each date that is given, what the year over year percentage change in price is per item listed. 我需要以某种方式排列数据并为每个给定的日期计算出所列出的每个项目的价格同比变化百分比。 And then find the average of the changes per date. 然后找到每个日期的平均更改。
Below is what I have tried to do to arrange the data, but I am having trouble figuring out which way is best and then how to calculate the y/y change. 以下是我尝试进行的数据整理,但是我在确定哪种方法最好以及如何计算y / y变化时遇到了麻烦。
import pandas as pd
import datetime as dt
import numpy as np
df = pd.read_csv('...python test file.csv')
asofdate = set ()
#sorting the dataframe chronologically by As of Date
df.sort_values(df.columns[11])
asofdate = list(df.apply(set)[11])
asofdate = [dt.datetime.strptime(date, '%m/%d/%Y').date() for date in asofdate]
#attempt 1
df = df.set_index("As of Date")
df = df[['Item','Price_Per_Unit']]
#attempt 2
df2 = df.pivot_table('Price_Per_Unit',['Item'], 'As of Date')
#date of lastupdate
lastupdated = df2.iloc[:,-1]
What I have to deal with the dates not being exact years is the below function (found on stackexchange) to find the most recent date: 我要处理的日期不是确切的年份是下面的函数(在stackexchange上找到)以查找最近的日期:
def nearest(items, pivot):
return min(items, key=lambda x: abs(x - pivot))
I know this is a pretty in depth question, but I would really appreciate any help or guidance anyone can provide. 我知道这是一个非常深入的问题,但是我非常感谢任何人都可以提供的任何帮助或指导。 I've been reading tons of other posts but please feel free to share some if you think they'd be helpful. 我阅读了许多其他文章,但如果您认为有帮助,请随时分享。 Thanks for any help! 谢谢你的帮助!
Not sure whether my understanding about your problem is right, have a look at snippet below. 不确定我对您的问题的理解是否正确,请查看下面的代码段。
import pandas as pd
import numpy as np
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
def get_prev_year_price(x, df):
try:
return df.loc[x['prev_year_date'], 'price']
except Exception as e: #not recommended to write in this way...just for demo
return x['price']
TESTDATA=StringIO("""Item price date
Item 1 1.08908 4/13/2016
Item 2 2.03281 4/13/2016
Item 3 3.02619 4/13/2016
Item 1 1.56743 12/21/2015
Item 3 12.31867 12/21/2015
Item 2 0.98066 12/21/2015
Item 4 0.31701 12/21/2015
Item 3 0.6251 3/31/2015
Item 1 6.87538 3/31/2015
Item 2 0.3113 3/31/2015
Item 4 0.18724 3/31/2015""")
df = pd.read_csv(TESTDATA, sep="\t")
df['date'] = pd.to_datetime(df['date'],format='%m/%d/%Y')
data = []
for item in df['Item'].unique():
item_df = df[df['Item'] == item] #select based on items
select_dates = item_df['date'].unique()
item_df.set_index('date', inplace=True) #set date as key index
item_df = item_df.resample('D').mean().reset_index() #fill in missing date
item_df['price'] = item_df['price'].interpolate('nearest') #fill in price with nearest price available
item_df['prev_year_date'] = item_df['date'] - pd.DateOffset(years=1) #calculate 1 year ago date
date_df = item_df[item_df.date.isin(select_dates)] #select datas with useful data
item_df.set_index('date', inplace=True)
date_df['prev_year_price'] = date_df.apply(lambda x: get_prev_year_price(x, item_df),axis=1)
date_df['change'] = date_df['price'] / date_df['prev_year_price']-1
date_df['Item'] = item
data.append(date_df)
summary = pd.concat(data).sort_values('date')
print (summary)
Result as: 结果为:
date price prev_year_date prev_year_price change Item
0 2015-03-31 6.87538 2014-03-31 6.87538 0.000000 Item 1
0 2015-03-31 0.31130 2014-03-31 0.31130 0.000000 Item 2
0 2015-03-31 0.62510 2014-03-31 0.62510 0.000000 Item 3
0 2015-03-31 0.18724 2014-03-31 0.18724 0.000000 Item 4
265 2015-12-21 1.56743 2014-12-21 1.56743 0.000000 Item 1
265 2015-12-21 0.98066 2014-12-21 0.98066 0.000000 Item 2
265 2015-12-21 12.31867 2014-12-21 12.31867 0.000000 Item 3
265 2015-12-21 0.31701 2014-12-21 0.31701 0.000000 Item 4
379 2016-04-13 1.08908 2015-04-13 6.87538 -0.841597 Item 1
379 2016-04-13 2.03281 2015-04-13 0.31130 5.530067 Item 2
379 2016-04-13 3.02619 2015-04-13 0.62510 3.841129 Item 3
By the way, you could improve your efficiency or code by reading pandas built-in library. 顺便说一句,您可以通过阅读pandas内置库来提高效率或代码。 ie how to get unique dates, how to convert dates 即如何获得唯一的日期,如何转换日期
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.