哪个是计算数组中元素的最快方法

Question

我有一个包含国家名称的大量列表，很多次。 我想创建一个新列表（较小），其中包含一个国家被包含在较大列表中的次数的总和。

[
      {
          "country_name": "US"
      },
      {
          "country_name": "US"
      },
      {
          "country_name": "GERMANY"
      },
      {
          "country_name": "GERMANY"
      },
      {
          "country_name": "GERMANY"
      },
      {
          "country_name": "ITALY"
      },
      {
          "country_name": "ITALY"
      },
      {
          "country_name": "ITALY"
      },
      {
          "country_name": "ITALY"
      }
]

这是使用此数据生成新列表的最快方法。 我想要这样的东西：我可以用 for 循环来做，但它太慢了。 有什么选择吗？

[
      {
          "country_name": "US",
          "count": 2
      },
      {
          "country_name": "GERMANY",
          "count": 3
      },
      {
          "country_name": "ITALY",
          "count": 4
      },
]

Answer 1

我没有大文件来检查速度并比较哪个更快

您可以尝试使用collections.Counter但它需要for -loop 从目录中获取值。

data = [
    {"country_name": "US"},
    {"country_name": "US"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
]

import collections

c = collections.Counter([x["country_name"] for x in data])

print(c)

结果：

Counter({'ITALY': 5, 'GERMANY': 3, 'US': 2})

You could also convert to pandas.DataFrame (or read from file/database directly to DataFrame ) and make all "calculations" with pandas which uses fast code created in C/C++/Fortran

data = [
    {"country_name": "US"},
    {"country_name": "US"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
]

import pandas as pd

df = pd.DataFrame(data)
#print(df)

print( df['country_name'].value_counts() )

结果：

ITALY      5
GERMANY    3
US         2
Name: country_name, dtype: int64

其他方法是将数据保存在数据库中并为此使用SQL - 通常数据库引擎比 Python 的循环更快。

SELECT country_name, COUNT(*) FROM data GROUP BY country_name

data = [
    {"country_name": "US"},
    {"country_name": "US"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
]

import sqlite3


def init(db, data):
    cur = db.cursor()

    cur.execute('DROP TABLE IF EXISTS data')

    cur.execute('CREATE TABLE IF NOT EXISTS data (id INTEGER PRIMARY KEY AUTOINCREMENT, country_name STRING)')

    for item in data:
        cur.execute('INSERT INTO data (country_name) VALUES (?)', (item['country_name'], ))

    db.commit()

    cur.close()

# ---

db = sqlite3.Connection('data.db')

init(db, data)  # use it only once

cur = db.cursor()

#result = cur.execute('SELECT * FROM data')
result = cur.execute('SELECT country_name, COUNT(*) FROM data GROUP BY country_name')
for row in result:
    print('row:', row)

cur.close()

结果：

row: ('GERMANY', 3)
row: ('ITALY', 5)
row: ('US', 2)

编辑：其他方法是使用比本地计算机更快的服务器 - 例如免费的Google Colab

Answer 2

collections.defaultdict的计数通常非常快：

from collections import defaultdict
from pprint import pprint

countries = [
    {"country_name": "US"},
    {"country_name": "US"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
]

counts = defaultdict(int)
for country in countries:
    counts[country["country_name"]] += 1

pprint([{"country_name": k, "count": v} for k, v in counts.items()])

Output：

[{'count': 2, 'country_name': 'US'},
 {'count': 3, 'country_name': 'GERMANY'},
 {'count': 5, 'country_name': 'ITALY'}]

哪个是计算数组中元素的最快方法

问题描述

2 个解决方案

解决方案1
3 已采纳 2020-05-22 14:03:39

解决方案2
0 2020-05-22 14:25:51

哪个是计算数组中元素的最快方法

问题描述

2 个解决方案

解决方案1 3 已采纳 2020-05-22 14:03:39

解决方案2 0 2020-05-22 14:25:51

解决方案1
3 已采纳 2020-05-22 14:03:39

解决方案2
0 2020-05-22 14:25:51