哪个是计算数组中元素的最快方法

Question

I have a massive list which contains a country_names, many times.我有一个包含国家名称的大量列表，很多次。 I would like to create a new list (smaller) which will contain a sum for how many times a country is included in the bigger list.我想创建一个新列表（较小），其中包含一个国家被包含在较大列表中的次数的总和。

[
      {
          "country_name": "US"
      },
      {
          "country_name": "US"
      },
      {
          "country_name": "GERMANY"
      },
      {
          "country_name": "GERMANY"
      },
      {
          "country_name": "GERMANY"
      },
      {
          "country_name": "ITALY"
      },
      {
          "country_name": "ITALY"
      },
      {
          "country_name": "ITALY"
      },
      {
          "country_name": "ITALY"
      }
]

which is the fastest way to generate a new list with this data.这是使用此数据生成新列表的最快方法。 I would like something like this: I can do it with a for loop but it is way too slow.我想要这样的东西：我可以用 for 循环来做，但它太慢了。 Any alternatives?有什么选择吗？

[
      {
          "country_name": "US",
          "count": 2
      },
      {
          "country_name": "GERMANY",
          "count": 3
      },
      {
          "country_name": "ITALY",
          "count": 4
      },
]

Answer 1

I don't have big file to check speed and compare which one is faster我没有大文件来检查速度并比较哪个更快

You can try to use collections.Counter but it would need for -loop to get value from directiories.您可以尝试使用collections.Counter但它需要for -loop 从目录中获取值。

data = [
    {"country_name": "US"},
    {"country_name": "US"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
]

import collections

c = collections.Counter([x["country_name"] for x in data])

print(c)

Result:结果：

Counter({'ITALY': 5, 'GERMANY': 3, 'US': 2})

You could also convert to pandas.DataFrame (or read from file/database directly to DataFrame ) and make all "calculations" with pandas which uses fast code created in C/C++/Fortran You could also convert to pandas.DataFrame (or read from file/database directly to DataFrame ) and make all "calculations" with pandas which uses fast code created in C/C++/Fortran

data = [
    {"country_name": "US"},
    {"country_name": "US"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
]

import pandas as pd

df = pd.DataFrame(data)
#print(df)

print( df['country_name'].value_counts() )

Result:结果：

ITALY      5
GERMANY    3
US         2
Name: country_name, dtype: int64

Other method is to keep data in database and use SQL for this - usually database engine is faster then Python's loop.其他方法是将数据保存在数据库中并为此使用SQL - 通常数据库引擎比 Python 的循环更快。

SELECT country_name, COUNT(*) FROM data GROUP BY country_name

data = [
    {"country_name": "US"},
    {"country_name": "US"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
]

import sqlite3


def init(db, data):
    cur = db.cursor()

    cur.execute('DROP TABLE IF EXISTS data')

    cur.execute('CREATE TABLE IF NOT EXISTS data (id INTEGER PRIMARY KEY AUTOINCREMENT, country_name STRING)')

    for item in data:
        cur.execute('INSERT INTO data (country_name) VALUES (?)', (item['country_name'], ))

    db.commit()

    cur.close()

# ---

db = sqlite3.Connection('data.db')

init(db, data)  # use it only once

cur = db.cursor()

#result = cur.execute('SELECT * FROM data')
result = cur.execute('SELECT country_name, COUNT(*) FROM data GROUP BY country_name')
for row in result:
    print('row:', row)

cur.close()

Result:结果：

row: ('GERMANY', 3)
row: ('ITALY', 5)
row: ('US', 2)

EDIT: Other method is to use faster server then local computer - for example free Google Colab编辑：其他方法是使用比本地计算机更快的服务器 - 例如免费的Google Colab

Answer 2

collections.defaultdict is usually quite fast for counting: collections.defaultdict的计数通常非常快：

from collections import defaultdict
from pprint import pprint

countries = [
    {"country_name": "US"},
    {"country_name": "US"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
]

counts = defaultdict(int)
for country in countries:
    counts[country["country_name"]] += 1

pprint([{"country_name": k, "count": v} for k, v in counts.items()])

Output: Output：

[{'count': 2, 'country_name': 'US'},
 {'count': 3, 'country_name': 'GERMANY'},
 {'count': 5, 'country_name': 'ITALY'}]

哪个是计算数组中元素的最快方法

问题描述

2 个解决方案

解决方案1
3 已采纳 2020-05-22 14:03:39

解决方案2
0 2020-05-22 14:25:51

哪个是计算数组中元素的最快方法

问题描述

2 个解决方案

解决方案1 3 已采纳 2020-05-22 14:03:39

解决方案2 0 2020-05-22 14:25:51

解决方案1
3 已采纳 2020-05-22 14:03:39

解决方案2
0 2020-05-22 14:25:51