简体   繁体   English

将Python列表分组在一起以获取公共元素

[英]Group Python lists together for common element

I'm querying Google Analytics data for sessions and users for each different country. 我正在查询Google Analytics(分析)数据中每个国家/地区的会话和用户。 I want to save this data in my db for each single day so I can access it later on. 我想每天将这些数据保存在数据库中,以便以后可以访问。

My query gives me a really big json back and I'm trying to find the optima solution to maximise speed. 我的查询给了我一个很大的json,我试图找到最佳解决方案以最大化速度。

First of all I managed to get back the data ordered by sessions, which means that I can now save only the first 10 countries in my db without saving for each day a new row for each country. 首先,我设法取回了按会话排序的数据,这意味着我现在只能在数据库中保存前10个国家,而不必每天为每个国家/地区保存新行。

I think this is the minimum amount of data I need in order to have valuable info. 我认为这是获得有价值信息所需的最少数据量。 So now I structured my bd to accept data like this: 所以现在我将bd结构化为接受如下数据:

20170101 | US | 112 (sessions) | 111 (users)
20170101 | CA | 111 (sessions) | 221 (users)
... (for 8 more rows)
20170102 | US | 11 (sessions) | 22 (users)
... (and so on, so 10 rows per day)

Now my big json that I get back looks something like this (I've removed a lot of metrics in the middle): 现在我返回的大json看起来像这样(我在中间删除了很多指标):

m = {
'reports': [{
    'data': {
        'rowCount': 2003,
        'maximums': [{
            'values': ['1219', '1109']
        }],
        'minimums': [{
            'values': ['1', '1']
        }],
        'totals': [{
            'values': ['33505', '30382']
        }],
        'rows': [{
            'dimensions': ['20170404', 'US'],
            'metrics': [{
                'values': ['1219', '1091']
            }]
        }, {
            'dimensions': ['20170406', 'US'],
            'metrics': [{
                'values': ['1203', '1109']
            }]
        }, {
            'dimensions': ['20170405', 'US'],
            'metrics': [{
                'values': ['1185', '1073']
            }]
        }, {
            'dimensions': ['20170408', 'PL'],
            'metrics': [{
                'values': ['2', '1']
            }]
        }, {
            'dimensions': ['20170408', 'SG'],
            'metrics': [{
                'values': ['2', '2']
            }]
        }, {
            'dimensions': ['20170408', 'TT'],
            'metrics': [{
                'values': ['2', '2']
            }]
        }]
    },
    'nextPageToken': '1000',
    'columnHeader': {
        'dimensions': ['ga:date', 'ga:countryIsoCode'],
        'metricHeader': {
            'metricHeaderEntries': [{
                'name': 'ga:sessions',
                'type': 'INTEGER'
            }, {
                'name': 'ga:users',
                'type': 'INTEGER'
            }]
        }
    }
}]
}

I'm trying to figure out how I can extract the top 10 countries with most sessions for each day and save this info in my db, so far I came up with: 我想弄清楚如何提取每天最多会话的前10个国家/地区并将此信息保存在数据库中,到目前为止,我想到了:

x = m['reports'][0]['data']['rows']

l =[]
for data in x:
    date = data['dimensions'][0]
    country = data['dimensions'][1]
    sessions = data['metrics'][0]['values'][0]
    users = data['metrics'][0]['values'][1]
    n = [date, [country,sessions, users]]
    l.append(n)

This generates me a list with inside values in the format [date[country, sessions, users]] 这会为我生成一个列表,其中的内部值格式为[date[country, sessions, users]]

so something like this: 所以像这样:

[['20170404', ['US', '1219', '1091']],
 ['20170406', ['US', '1203', '1109']],
 ['20170405', ['US', '1185', '1073']],
 ['20170408', ['PL', '2', '1']],
 ['20170408', ['SG', '2', '2']],
 ['20170408', ['TT', '2', '2']]]

Now I was thinking to nest an other for loop which checks the date and if it's the same it will add the values z[1] to the same list, so for every date I would have a list with the values for each single country. 现在,我正在考虑嵌套另一个用于检查日期的for循环,如果该循环相同,则会将值z[1]到同一列表中,因此对于每个日期,我都会有一个包含每个国家/地区值的列表。 However I'm not sure how to group these dictionary together according the the first value z[0] plus this would do all the countries and not only the top 10 ones. 但是,我不确定如何根据第一个值z[0]将这些字典分组在一起,而且这将适用于所有国家,而不仅是前十个国家。

Is there an easier way to accomplish this given the big json above? 给定上面的大json,是否有更简单的方法来完成此操作? If how do I group lists together according to the first value and how I then sort by sessions? 如果我如何根据第一个值将列表分组在一起,然后如何按会话排序?

Thanks! 谢谢!

When there are no duplicate countries per day. 每天没有重复的国家/地区。 You could use defaultdicts , to mange the different levels of grouping (magically): 您可以使用defaultdicts来管理不同级别的分组(神奇地):

import pprint
from collections import defaultdict

def recursive_defaultdict():
    return defaultdict(recursive_defaultdict)

l = recursive_defaultdict()

x = m['reports'][0]['data']['rows']

for data in x:
    date = data['dimensions'][0]
    country = data['dimensions'][1]
    sessions = data['metrics'][0]['values'][0]
    users = data['metrics'][0]['values'][1]

    l[date][country] = {'sessions': sessions, 'users': users}

pprint.pprint(l)

This returns a dict, that allows you to easily iterate over: 这将返回一个dict,使您可以轻松地迭代:

defaultdict(<function recursive_defaultdict at 0x7f3ecfb45e18>,
            {'20170404': defaultdict(<function recursive_defaultdict at 0x7f3ecfb45e18>,
                                     {'US': {'sessions': '1219',
                                             'users': '1091'}}),
             '20170405': defaultdict(<function recursive_defaultdict at 0x7f3ecfb45e18>,
                                     {'US': {'sessions': '1185',
                                             'users': '1073'}}),
             '20170406': defaultdict(<function recursive_defaultdict at 0x7f3ecfb45e18>,
                                     {'US': {'sessions': '1203',
                                             'users': '1109'}}),
             '20170408': defaultdict(<function recursive_defaultdict at 0x7f3ecfb45e18>,
                                     {'PL': {'sessions': '2', 'users': '1'},
                                      'SG': {'sessions': '2', 'users': '2'},
                                      'TT': {'sessions': '2', 'users': '2'}})})

To receive a specific combination of date/country: 接收日期/国家/地区的特定组合:

print (l['20170404']['US'])
>>> {'sessions': '1219', 'users': '1091'}

Iterate through result: 遍历结果:

for date, values in l.items():
    for country, value in values.items():
        print (date, country, value)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM