[英]Performing multiple aggregate functions in mongoDB
我有一個python腳本,用於從網絡上抓取數據。 然后將數據存儲到MongoDB數據庫中。 數據格式如下:
{
"id": "abcd",
"value": 100.0,
"timestamp": "2011-07-14 19:43:37"
}
我有很多這樣的數據。 我想做的是按小時對數據進行分組,並獲得值的平均值以及最小值和最大值。 從mongodb文檔中可以看到,聚合管道和map-reduce都可以按小時或avg進行分組,然后我可以將其存儲回數據庫並重新運行聚合管道,或者對中間數據進行map reduce。
有什么方法可以一步完成,而無需將數據存儲在臨時表中並運行新的迭代?
這可能會有所幫助:
from datetime import datetime
from itertools import groupby
from pprint import pprint
# assuming that collection of data objects is a list
datas = [
{
"id": "abcd",
"value": 100.0,
"timestamp": "2011-07-14 19:43:37"
},
{
"id": "abcd",
"value": 500.0,
"timestamp": "2011-07-15 20:30:37"
},
{
"id": "abcd",
"value": 400.0,
"timestamp": "2011-07-15 20:30:38"
}
]
decorated_datas = []
# first we need to add a key with each data, that would be needed during sorting
# and that key would be date and hour
for data in datas:
timestamp = datetime.strptime(data["timestamp"], "%Y-%m-%d %H:%M:%S") # assuming your timestamp is in this format only
decorated_datas.append((timestamp.date(), timestamp.time().hour, data))
# then we sort the data created in the last step using the date and hour
sorted_decorated_datas = sorted(decorated_datas, key=lambda x: (x[0], x[1]))
# function for calculating statistics of a given collection of numbers
def calculate_stats(collection_of_numbers):
maxVal = max(collection_of_numbers)
minVal = min(collection_of_numbers)
avgVal = sum(collection_of_numbers) / len(collection_of_numbers)
return (maxVal, minVal, avgVal)
results = []
# then we group our sorted data by date and hour, and then we calculate
# statistics for the group and append result to our final results
for key, group_iter in groupby(sorted_decorated_datas, lambda x: (x[0], x[1])):
group_values = [data[2]["value"] for data in group_iter]
maxValue, minValue, avgValue = calculate_stats(group_values)
result = {"date": key[0], "hour": key[1], "minVal":
minValue, "maxVal": maxValue, "avgVal": avgValue}
results.append(result)
pprint(results)
輸出為:
[{'avgVal': 100.0,
'date': datetime.date(2011, 7, 14),
'hour': 19,
'maxVal': 100.0,
'minVal': 100.0},
{'avgVal': 450.0,
'date': datetime.date(2011, 7, 15),
'hour': 20,
'maxVal': 500.0,
'minVal': 400.0}]
編輯經過一番思考,我發現在字符串中表示時間戳的格式是不需要轉換為datetime
對象的理想選擇,這些時間戳字符串可以自己排序,而無需將它們轉換為datetime對象,所以這是更新的代碼:
from itertools import groupby
from pprint import pprint
# assuming that collection of data objects is a list
datas = [
{
"id": "abcd",
"value": 100.0,
"timestamp": "2011-07-14 19:43:37"
},
{
"id": "abcd",
"value": 500.0,
"timestamp": "2011-07-15 20:30:37"
},
{
"id": "abcd",
"value": 400.0,
"timestamp": "2011-07-15 20:30:38"
}
]
def get_date_and_hour(timestamp_str):
date, time = timestamp_str.split()
date = tuple(map(int, date.split('-')))
hour = int(time.split(':')[0])
return tuple((*date, hour))
def calculate_stats(collection_of_numbers):
maxVal = max(collection_of_numbers)
minVal = min(collection_of_numbers)
avgVal = sum(collection_of_numbers) / len(collection_of_numbers)
return (maxVal, minVal, avgVal)
results = []
sorted_datas = sorted(datas, key=lambda x: x["timestamp"])
for key, group_iter in groupby(sorted_datas, lambda x: get_date_and_hour(x["timestamp"])):
group_values = [data["value"] for data in group_iter]
maxValue, minValue, avgValue = calculate_stats(group_values)
result = {"date": key[0:3], "hour": key[3], "minVal":
minValue, "maxVal": maxValue, "avgVal": avgValue}
results.append(result)
pprint(results)
輸出為:
[{'avgVal': 100.0,
'date': (2011, 7, 14),
'hour': 19,
'maxVal': 100.0,
'minVal': 100.0},
{'avgVal': 450.0,
'date': (2011, 7, 15),
'hour': 20,
'maxVal': 500.0,
'minVal': 400.0}]
這個版本比以前的版本更短,更可維護。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.