熊猫数据框，组织和计算

Question

I need help - I've spent all day (14+ hours) trying to set up a panda dataframe for a test file that I made. 我需要帮助-我花了一整天（超过14个小时）尝试为我制作的测试文件设置熊猫数据框。 My real file is a csv that is several million lines long so I am trying to find the fastest and most effective way of handling the data. 我的真实文件是一个csv，它的长度为几百万行，因此，我试图找到处理数据的最快，最有效的方法。 What I need to do is calculate year over year changes in prices for a list of items. 我需要做的是计算项目清单价格的逐年变化。

The data I have looks like this after I drop the unneeded columns: 删除不需要的列后，数据如下所示：

Item    Price   As of Date
Item 1  1.08908 4/13/2016
Item 2  2.03281 4/13/2016
Item 3  3.02619 4/13/2016
Item 1  1.56743 12/21/2015
Item 3  12.31867    12/21/2015
Item 2  0.98066 12/21/2015
Item 4  0.31701 12/21/2015
Item 3  0.6251  3/31/2015
Item 1  6.87538 3/31/2015
Item 2  0.3113  3/31/2015
Item 4  0.18724 3/31/2015

First, I need to get the data into a way that I can make the year over year calculation. 首先，我需要将数据转化为可以进行逐年计算的方式。 It is arranged with columns for the Item, Price, and the As of Date. 它按项目，价格和截止日期的列排列。 I need to somehow arrange the data and calculate for each date that is given, what the year over year percentage change in price is per item listed. 我需要以某种方式排列数据并为每个给定的日期计算出所列出的每个项目的价格同比变化百分比。 And then find the average of the changes per date. 然后找到每个日期的平均更改。

Below is what I have tried to do to arrange the data, but I am having trouble figuring out which way is best and then how to calculate the y/y change. 以下是我尝试进行的数据整理，但是我在确定哪种方法最好以及如何计算y / y变化时遇到了麻烦。

import pandas as pd
import datetime as dt
import numpy as np

df = pd.read_csv('...python test file.csv')
asofdate = set ()

#sorting the dataframe chronologically by As of Date
df.sort_values(df.columns[11])

asofdate = list(df.apply(set)[11])
asofdate = [dt.datetime.strptime(date, '%m/%d/%Y').date() for date in asofdate]

#attempt 1
df = df.set_index("As of Date")
df = df[['Item','Price_Per_Unit']]

#attempt 2
df2 = df.pivot_table('Price_Per_Unit',['Item'], 'As of Date')

#date of lastupdate
lastupdated = df2.iloc[:,-1]

What I have to deal with the dates not being exact years is the below function (found on stackexchange) to find the most recent date: 我要处理的日期不是确切的年份是下面的函数（在stackexchange上找到）以查找最近的日期：

def nearest(items, pivot):
    return min(items, key=lambda x: abs(x - pivot))

I know this is a pretty in depth question, but I would really appreciate any help or guidance anyone can provide. 我知道这是一个非常深入的问题，但是我非常感谢任何人都可以提供的任何帮助或指导。 I've been reading tons of other posts but please feel free to share some if you think they'd be helpful. 我阅读了许多其他文章，但如果您认为有帮助，请随时分享。 Thanks for any help! 谢谢你的帮助！

Answer 1

Not sure whether my understanding about your problem is right, have a look at snippet below. 不确定我对您的问题的理解是否正确，请查看下面的代码段。

import pandas as pd
import numpy as np
import sys
if sys.version_info[0] < 3: 
    from StringIO import StringIO
else:
    from io import StringIO



def get_prev_year_price(x, df):
    try:
        return df.loc[x['prev_year_date'], 'price']
    except Exception as e: #not recommended to write in this way...just for demo
        return x['price']


TESTDATA=StringIO("""Item   price   date
Item 1  1.08908 4/13/2016
Item 2  2.03281 4/13/2016
Item 3  3.02619 4/13/2016
Item 1  1.56743 12/21/2015
Item 3  12.31867    12/21/2015
Item 2  0.98066 12/21/2015
Item 4  0.31701 12/21/2015
Item 3  0.6251  3/31/2015
Item 1  6.87538 3/31/2015
Item 2  0.3113  3/31/2015
Item 4  0.18724 3/31/2015""")

df = pd.read_csv(TESTDATA, sep="\t")


df['date'] = pd.to_datetime(df['date'],format='%m/%d/%Y')


data = []
for item in df['Item'].unique():
    item_df = df[df['Item'] == item] #select based on items
    select_dates = item_df['date'].unique()
    item_df.set_index('date', inplace=True) #set date as key index
    item_df = item_df.resample('D').mean().reset_index() #fill in missing date
    item_df['price'] = item_df['price'].interpolate('nearest') #fill in price with nearest price available
    item_df['prev_year_date'] = item_df['date'] - pd.DateOffset(years=1) #calculate 1 year ago date
    date_df = item_df[item_df.date.isin(select_dates)] #select datas with useful data
    item_df.set_index('date', inplace=True)
    date_df['prev_year_price'] = date_df.apply(lambda x: get_prev_year_price(x, item_df),axis=1)
    date_df['change'] = date_df['price'] / date_df['prev_year_price']-1
    date_df['Item'] = item
    data.append(date_df)
summary = pd.concat(data).sort_values('date')
print (summary)

Result as: 结果为：

          date     price prev_year_date  prev_year_price    change    Item
0   2015-03-31   6.87538     2014-03-31          6.87538  0.000000  Item 1
0   2015-03-31   0.31130     2014-03-31          0.31130  0.000000  Item 2
0   2015-03-31   0.62510     2014-03-31          0.62510  0.000000  Item 3
0   2015-03-31   0.18724     2014-03-31          0.18724  0.000000  Item 4
265 2015-12-21   1.56743     2014-12-21          1.56743  0.000000  Item 1
265 2015-12-21   0.98066     2014-12-21          0.98066  0.000000  Item 2
265 2015-12-21  12.31867     2014-12-21         12.31867  0.000000  Item 3
265 2015-12-21   0.31701     2014-12-21          0.31701  0.000000  Item 4
379 2016-04-13   1.08908     2015-04-13          6.87538 -0.841597  Item 1
379 2016-04-13   2.03281     2015-04-13          0.31130  5.530067  Item 2
379 2016-04-13   3.02619     2015-04-13          0.62510  3.841129  Item 3

By the way, you could improve your efficiency or code by reading pandas built-in library. 顺便说一句，您可以通过阅读pandas内置库来提高效率或代码。 ie how to get unique dates, how to convert dates 即如何获得唯一的日期，如何转换日期

熊猫数据框，组织和计算

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-09-08 04:26:32

熊猫数据框，组织和计算

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-09-08 04:26:32

解决方案1
1 已采纳 2017-09-08 04:26:32