简体   繁体   中英

Iterate pandas data frame for rows consists of arrays and compute a moving average based on condition

I can't figure out a problem I am trying to solve. I have a pandas data frame coming from this:

date,       id,     measure,    result
2016-07-11, 31, "[2, 5, 3, 3]",     1
2016-07-12, 32, "[3, 5, 3, 3]",     1
2016-07-13, 33, "[2, 1, 2, 2]",     1
2016-07-14, 34, "[2, 6, 3, 3]",     1
2016-07-15, 35, "[39, 31, 73, 34]", 0
2016-07-16, 36, "[3, 2, 3, 3]",     1
2016-07-17, 37, "[3, 8, 3, 3]",     1

Measurements column consists of arrays in string format.

I want to have a new moving-average-array column from the past 3 measurement records, excluding those records where the result is 0. Past 3 records mean that for id 34, the arrays of id 31,32,33 to be used.

It is about taking the average of every 1st point, 2nd point, 3rd and 4th point to have this moving-average-array .

It is not about getting the average of 1st array, 2nd array ... and then averaging the average, no .

For the first 3 rows, because there is not enough history, I just want to use their own measurement. So the solution should look like this:

date,       id,     measure,    result .     Solution
2016-07-11, 31, "[2, 5, 3, 3]",     1,      "[2,   5, 3,   3]"
2016-07-12, 32, "[3, 5, 3, 3]",     1,      "[3,   5, 3,   3]"
2016-07-13, 33, "[2, 1, 2, 2]",     1,      "[2,   1, 2,   2]"
2016-07-14, 34, "[2, 6, 3, 3]",     1,      "[2.3, 3.6, 2.6, 2.6]"
2016-07-15, 35, "[39, 31, 73, 34]", 0,      "[2.3, 4, 2.6, 2.6]"
2016-07-16, 36, "[3, 2, 3, 3]",     1,      "[2.3, 4, 2.6, 2.6]"
2016-07-17, 37, "[3, 8, 3, 3]",     1,      "[2.3, 3, 2.6, 2.6]"

The real data is bigger. result 0 may repeat 2 or more times after each other also. I think it will be about keeping a track of previous OK result s properly getting those averages. I spent time but I could not.

I am posting the dataframe here:

 mydict = {'date': {0: '2016-07-11',
      1: '2016-07-12',
      2: '2016-07-13',
      3: '2016-07-14',
      4: '2016-07-15',
      5: '2016-07-16',
      6: '2016-07-17'},
     'id': {0: 31, 1: 32, 2: 33, 3: 34, 4: 35, 5: 36, 6: 37},
     'measure': {0: '[2, 5, 3, 3]',
      1: '[3, 5, 3, 3]',
      2: '[2, 1, 2, 2]',
      3: '[2, 6, 3, 3]',
      4: '[39, 31, 73, 34]',
      5: '[3, 2, 3, 3]',
      6: '[3, 8, 3, 3]'},
     'result': {0: 1, 1: 1, 2: 1, 3: 1, 4: 0, 5: 1, 6: 1}}

df = pd.DataFrame(mydict)

Thank you for giving directions or pointing out how to.

Solution using only 1 for loop:

Considering the data:

mydict = {'date': {0: '2016-07-11',
      1: '2016-07-12',
      2: '2016-07-13',
      3: '2016-07-14',
      4: '2016-07-15',
      5: '2016-07-16',
      6: '2016-07-17'},
     'id': {0: 31, 1: 32, 2: 33, 3: 34, 4: 35, 5: 36, 6: 37},
     'measure': {0: '[2, 5, 3, 3]',
      1: '[3, 5, 3, 3]',
      2: '[2, 1, 2, 2]',
      3: '[2, 6, 3, 3]',
      4: '[39, 31, 73, 34]',
      5: '[3, 2, 3, 3]',
      6: '[3, 8, 3, 3]'},
     'result': {0: 1, 1: 1, 2: 1, 3: 1, 4: 0, 5: 1, 6: 1}}
df = pd.DataFrame(mydict)

I defined a simple function to calculate the means and return a list. Then, loop the dataframe applying the rules:

def calc_mean(in_list):
    p0 = round((in_list[0][0] + in_list[1][0] + in_list[2][0])/3,1)
    p1 = round((in_list[0][1] + in_list[1][1] + in_list[2][1])/3,1)
    p2 = round((in_list[0][2] + in_list[1][2] + in_list[2][2])/3,1)
    p3 = round((in_list[0][3] + in_list[1][3] + in_list[2][3])/3,1)
    return [p0, p1, p2, p3]

Solution = []
aux_list = []
for index, row in df.iterrows():
    if index in [0,1,2]:
        Solution.append(row.measure)
        aux_list.append([int(x) for x in row.measure[1:-1].split(', ')])
    else:
        Solution.append('[' +', '.join(map(str, calc_mean(aux_list))) + ']')
        if row.result > 0:
            aux_list.pop(0)
            aux_list.append([int(x) for x in row.measure[1:-1].split(', ')])        
df['Solution'] = Solution

The output is:

在此处输入图片说明

Please note that the result is rounded to 1 decimal place, a bit different from your desired output. Made more sense to me.

EDIT:

As a suggestion in the comments by @Frenchy, to deal with result == 0 in the first 3 rows, we need to change a bit the first if clause:

if index in [0,1,2] or len(aux_list) <3:
    Solution.append(row.measure)
    if row.result > 0:
        aux_list.append([int(x) for x in row.measure[1:-1].split(', ')])

You can use pd.eval to change from a str of list to a proper list only the part of the data in measure where result is not 0. Use rolling with mean and then shift to get the rolling average over the last 3 rows at the next row. Then map to str once your dataframe is changed to a list of list with values and tolist . Finally you just need to replace the first three rows and ffill the missing data:

df.loc[df.result.shift() != 0,'solution'] = list(map(str,
                              pd.DataFrame(pd.eval(df[df.result != 0].measure))
                                .rolling(3).mean().shift().values.tolist()))
df.loc[:2,'solution'] = df.loc[:2,'measure']
df.solution = df.solution.ffill()

Here's another solution:

# get data to reproduce example
from io import StringIO
data = StringIO(""" 
    date;id;measure;result 
    2016-07-11;31;"[2,5,3,3]";1 
    2016-07-12;32;"[3,5,3,3]";1 
    2016-07-13;33;"[2,1,2,2]";1 
    2016-07-14;34;"[2,6,3,3]";1 
    2016-07-15;35;"[39,31,73,34]";0 
    2016-07-16;36;"[3,2,3,3]";1 
    2016-07-17;37;"[3,8,3,3]";1 
    """)  

df = pd.read_csv(data, sep=";")
df
# Out:
#          date  id        measure  result
# 0  2016-07-11  31      [2,5,3,3]       1
# 1  2016-07-12  32      [3,5,3,3]       1
# 2  2016-07-13  33      [2,1,2,2]       1
# 3  2016-07-14  34      [2,6,3,3]       1
# 4  2016-07-15  35  [39,31,73,34]       0
# 5  2016-07-16  36      [3,2,3,3]       1
# 6  2016-07-17  37      [3,8,3,3]       1  

# convert values in measure column to lists
from ast import literal_eval
dm = df['measure'].apply(literal_eval)

# apply rolling mean with period 2 and recollect values into list in column means
df["means"] = dm.apply(pd.Series).rolling(2, min_periods=0).mean().values.tolist()                            

df                                                                                                           
# Out: 
#          date  id        measure  result                     means
# 0  2016-07-11  31      [2,5,3,3]       1      [2.0, 5.0, 3.0, 3.0]
# 1  2016-07-12  32      [3,5,3,3]       1      [2.5, 5.0, 3.0, 3.0]
# 2  2016-07-13  33      [2,1,2,2]       1      [2.5, 3.0, 2.5, 2.5]
# 3  2016-07-14  34      [2,6,3,3]       1      [2.0, 3.5, 2.5, 2.5]
# 4  2016-07-15  35  [39,31,73,34]       0  [20.5, 18.5, 38.0, 18.5]
# 5  2016-07-16  36      [3,2,3,3]       1  [21.0, 16.5, 38.0, 18.5]
# 6  2016-07-17  37      [3,8,3,3]       1      [3.0, 5.0, 3.0, 3.0]

# moving window of size 3
df["means"] = dm.apply(pd.Series).rolling(3, min_periods=0).mean().round(2).values.tolist()
df
# Out: 
#             date  id        measure  result                        means
# 0  2016-07-11  31      [2,5,3,3]       1         [2.0, 5.0, 3.0, 3.0]
# 1  2016-07-12  32      [3,5,3,3]       1         [2.5, 5.0, 3.0, 3.0]
# 2  2016-07-13  33      [2,1,2,2]       1     [2.33, 3.67, 2.67, 2.67]
# 3  2016-07-14  34      [2,6,3,3]       1      [2.33, 4.0, 2.67, 2.67]
# 4  2016-07-15  35  [39,31,73,34]       0   [14.33, 12.67, 26.0, 13.0]
# 5  2016-07-16  36      [3,2,3,3]       1  [14.67, 13.0, 26.33, 13.33]
# 6  2016-07-17  37      [3,8,3,3]       1  [15.0, 13.67, 26.33, 13.33]    

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM