简体   繁体   中英

Averaging specific list elements iteratively?

Say I have a dataset with a variable, lines, that looks like this:

lines = ['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=', '1', '10', '38', '0.0', '9', '20050407', '20170319', '0', '0', '0', '0', '1', '1', '281.6']
['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=', '1', '10', '38', '0.0', '9', '20050407', '20170319', '0', '0', '0', '0', '1', '1', '281.6']
['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=', '1', '10', '38', '0.0', '9', '20050407', '20170319', '0', '0', '0', '0', '1', '1', '281.6']
['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=', '1', '10', '38', '0.0', '9', '20050407', '20170319', '0', '0', '0', '0', '1', '1', '281.6']

How do I, if and only if lines[0] == lines[0] , meaning only if the first element of the list is the exact same, average specific values in the rest of the list, and combine that into one, averaged list? Of course, I will have to convert all numbers into floats.

In the specific example, I want a singular list, where all the numeric values besides lines[1] and lines[-1] are averaged. Any easy way?

Expected output

['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=', 1, avg_of_var, avg_of_var, avg, , '20050407', '20170319', '0', '0', '0', '0', '1', '1', '281.6']

Basically - and I see now that my example data is unfortunate as all values are the same - but I want a singular list containing an average of the numeric values of the four lines in the example.

will this simple python snippet works

# I am assuming lines is a list of line
lines = [['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=', '1', '10', '38', '0.0', '9', '20050407', '20170319', '0', '0', '0', '0', '1', '1', '281.6'],
['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jq2OxDr8zqa7bRQ=', '1', '10', '38', '0.0', '9', '20050407', '20170319', '0', '0', '0', '0', '1', '1', '281.6'],
['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jq2OxDr8zqa7bRQ=', '1', '10', '38', '0.0', '9', '20050407', '20170319', '0', '0', '0', '0', '1', '1', '281.6'],
['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=', '1', '10', '38', '0.0', '9', '20050407', '20170319', '0', '0', '0', '0', '1', '1', '281.6']]


# I am gonna use dict to distinct line[0] as key
# will keep adding to dict , if first time
# otherwise add all the values to corresponding index
# also keep track of number of lines to find out avg at last
average = {}
for line in lines:
    # first time just enter data to dict
    # and initialise qty as 1
    if line[0] not in average:
        average[line[0]] = {
            'data': line,
            'qty' : 1
        }

        continue

    add column data after type conversion to float
    i = 1
    while i < len(line):
        average[line[0]]['data'][i] = float(average[line[0]]['data'][i]) + float(line[i])
        i+=1

    average[line[0]]['qty'] += 1;

# now create another list of required lines
merged_lines = []
for key in average:
    line = []
    line.append(key)
    # this is to calculate average
    for element in average[key]['data'][1:]:
        line.append(element/average[key]['qty'])

    merged_lines.append(line)

print merged_lines

You can use pandas to create a dataframe. You can then group by lines[0] and then aggregate by mean (for desired columns only). However, you also need to specify aggregation method for other columns as well. I will assume, you also need the mean for these columns.

import pandas as pd
from numpy import mean

lines = [['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=', 1, 10, 38, 0.0, 9, 
20050407, 20170319, 0, 0, 0, 0, 1, 1, 281.6],
     ['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=', 1, 10, 38, 0.0, 9, 
20050407, 20170319, 0, 0, 0, 0, 1, 1, 281.6],
     ['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=', 1, 10, 38, 0.0, 9, 
20050407, 20170319, 0, 0, 0, 0, 1, 1, 281.6],
     ['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=', 1, 10, 38, 0.0, 9, 
20050407, 20170319, 0, 0, 0, 0, 1, 1, 281.6]]
# I have removed the quotes around numbers for simplification but this can also be handled by pandas.

# create a data frame and give names to your fields.
# Here 'KEY' is the name of the first field we will use for grouping 
df = pd.DataFrame(lines,columns=['KEY','a','b','c','d','e','f','g','h','i','j','k','l','m','n'])

This yields something like this:

    KEY                                             a   b   c   d   e   f   g   h   i   j   k   l   m   n
0   QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=    1   10  38  0.0 9   20050407    20170319    0   0   0   0   1   1   281.6
1   QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=    1   10  38  0.0 9   20050407    20170319    0   0   0   0   1   1   281.6
2   QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=    1   10  38  0.0 9   20050407    20170319    0   0   0   0   1   1   281.6
3   QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=    1   10  38  0.0 9   20050407    20170319    0   0   0   0   1   1   281.6

This is the operation you are looking for:

data = df.groupby('KEY',as_index=False).aggregate(mean)

This yields:

    KEY                                             a   b   c   d   e   f   g   h   i   j   k   l   m   n
0   QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=    1   10  38  0.0 9   20050407    20170319    0   0   0   0   1   1   281.6

You can specify the aggregation type by field by using a dictionary (assuming 'mean' for every field):

data = df.groupby('KEY',as_index=False).aggregate({'a':mean,'b':mean,'c':mean,'d':mean,'e':mean,'f':mean,'g':mean,'h':mean,'i':mean,'j':mean,'k':mean,'l':mean,'m':mean,'n':mean})

More information about groupby can be found here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.agg.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM