[英]Averaging specific list elements iteratively?
Say I have a dataset with a variable, lines, that looks like this: 假设我有一个数据集,其中包含变量,线,如下所示:
lines = ['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=', '1', '10', '38', '0.0', '9', '20050407', '20170319', '0', '0', '0', '0', '1', '1', '281.6']
['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=', '1', '10', '38', '0.0', '9', '20050407', '20170319', '0', '0', '0', '0', '1', '1', '281.6']
['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=', '1', '10', '38', '0.0', '9', '20050407', '20170319', '0', '0', '0', '0', '1', '1', '281.6']
['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=', '1', '10', '38', '0.0', '9', '20050407', '20170319', '0', '0', '0', '0', '1', '1', '281.6']
How do I, if and only if lines[0] == lines[0]
, meaning only if the first element of the list is the exact same, average specific values in the rest of the list, and combine that into one, averaged list? 如何且仅当lines[0] == lines[0]
时才有意义,这意味着仅当列表的第一个元素完全相同时,才对列表其余部分的平均特定值求平均值,然后将其组合成一个平均值清单? Of course, I will have to convert all numbers into floats. 当然,我将必须将所有数字转换为浮点数。
In the specific example, I want a singular list, where all the numeric values besides lines[1] and lines[-1] are averaged. 在特定示例中,我需要一个单数列表,其中除line [1]和lines [-1]以外的所有数值均取平均值。 Any easy way? 有什么简单的方法吗?
Expected output 预期产量
['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=', 1, avg_of_var, avg_of_var, avg, , '20050407', '20170319', '0', '0', '0', '0', '1', '1', '281.6']
Basically - and I see now that my example data is unfortunate as all values are the same - but I want a singular list containing an average of the numeric values of the four lines in the example. 基本上-现在我看到我的示例数据很不幸,因为所有值都相同-但我想要一个单数列表,其中包含示例中四行数字的平均值。
will this simple python snippet works 这个简单的python代码段会工作吗
# I am assuming lines is a list of line
lines = [['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=', '1', '10', '38', '0.0', '9', '20050407', '20170319', '0', '0', '0', '0', '1', '1', '281.6'],
['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jq2OxDr8zqa7bRQ=', '1', '10', '38', '0.0', '9', '20050407', '20170319', '0', '0', '0', '0', '1', '1', '281.6'],
['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jq2OxDr8zqa7bRQ=', '1', '10', '38', '0.0', '9', '20050407', '20170319', '0', '0', '0', '0', '1', '1', '281.6'],
['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=', '1', '10', '38', '0.0', '9', '20050407', '20170319', '0', '0', '0', '0', '1', '1', '281.6']]
# I am gonna use dict to distinct line[0] as key
# will keep adding to dict , if first time
# otherwise add all the values to corresponding index
# also keep track of number of lines to find out avg at last
average = {}
for line in lines:
# first time just enter data to dict
# and initialise qty as 1
if line[0] not in average:
average[line[0]] = {
'data': line,
'qty' : 1
}
continue
add column data after type conversion to float
i = 1
while i < len(line):
average[line[0]]['data'][i] = float(average[line[0]]['data'][i]) + float(line[i])
i+=1
average[line[0]]['qty'] += 1;
# now create another list of required lines
merged_lines = []
for key in average:
line = []
line.append(key)
# this is to calculate average
for element in average[key]['data'][1:]:
line.append(element/average[key]['qty'])
merged_lines.append(line)
print merged_lines
You can use pandas to create a dataframe. 您可以使用熊猫创建数据框。 You can then group by lines[0] and then aggregate by mean (for desired columns only). 然后,您可以按行[0]分组,然后按均值聚合(仅适用于所需的列)。 However, you also need to specify aggregation method for other columns as well. 但是,您还需要为其他列指定聚合方法。 I will assume, you also need the mean for these columns. 我假设,您还需要这些列的均值。
import pandas as pd
from numpy import mean
lines = [['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=', 1, 10, 38, 0.0, 9,
20050407, 20170319, 0, 0, 0, 0, 1, 1, 281.6],
['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=', 1, 10, 38, 0.0, 9,
20050407, 20170319, 0, 0, 0, 0, 1, 1, 281.6],
['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=', 1, 10, 38, 0.0, 9,
20050407, 20170319, 0, 0, 0, 0, 1, 1, 281.6],
['QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ=', 1, 10, 38, 0.0, 9,
20050407, 20170319, 0, 0, 0, 0, 1, 1, 281.6]]
# I have removed the quotes around numbers for simplification but this can also be handled by pandas.
# create a data frame and give names to your fields.
# Here 'KEY' is the name of the first field we will use for grouping
df = pd.DataFrame(lines,columns=['KEY','a','b','c','d','e','f','g','h','i','j','k','l','m','n'])
This yields something like this: 这将产生如下内容:
KEY a b c d e f g h i j k l m n
0 QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ= 1 10 38 0.0 9 20050407 20170319 0 0 0 0 1 1 281.6
1 QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ= 1 10 38 0.0 9 20050407 20170319 0 0 0 0 1 1 281.6
2 QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ= 1 10 38 0.0 9 20050407 20170319 0 0 0 0 1 1 281.6
3 QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ= 1 10 38 0.0 9 20050407 20170319 0 0 0 0 1 1 281.6
This is the operation you are looking for: 这是您要查找的操作:
data = df.groupby('KEY',as_index=False).aggregate(mean)
This yields: 这样产生:
KEY a b c d e f g h i j k l m n
0 QA7uiXy8vIbUSPOkCf9RwQ3FsT8jVq2OxDr8zqa7bRQ= 1 10 38 0.0 9 20050407 20170319 0 0 0 0 1 1 281.6
You can specify the aggregation type by field by using a dictionary (assuming 'mean' for every field): 您可以使用字典来按字段指定聚合类型(假设每个字段为“均值”):
data = df.groupby('KEY',as_index=False).aggregate({'a':mean,'b':mean,'c':mean,'d':mean,'e':mean,'f':mean,'g':mean,'h':mean,'i':mean,'j':mean,'k':mean,'l':mean,'m':mean,'n':mean})
More information about groupby can be found here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.groupby.DataFrameGroupBy.agg.html 有关groupby的更多信息,请参见: http : //pandas.pydata.org/pandas-docs/stable/generation/pandas.core.groupby.DataFrameGroupBy.agg.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.