简体   繁体   English

哪个是计算数组中元素的最快方法

[英]Which is the fastest way to count elements in array

I have a massive list which contains a country_names, many times.我有一个包含国家名称的大量列表,很多次。 I would like to create a new list (smaller) which will contain a sum for how many times a country is included in the bigger list.我想创建一个新列表(较小),其中包含一个国家被包含在较大列表中的次数的总和。

[
      {
          "country_name": "US"
      },
      {
          "country_name": "US"
      },
      {
          "country_name": "GERMANY"
      },
      {
          "country_name": "GERMANY"
      },
      {
          "country_name": "GERMANY"
      },
      {
          "country_name": "ITALY"
      },
      {
          "country_name": "ITALY"
      },
      {
          "country_name": "ITALY"
      },
      {
          "country_name": "ITALY"
      }
]

which is the fastest way to generate a new list with this data.这是使用此数据生成新列表的最快方法。 I would like something like this: I can do it with a for loop but it is way too slow.我想要这样的东西:我可以用 for 循环来做,但它太慢了。 Any alternatives?有什么选择吗?

[
      {
          "country_name": "US",
          "count": 2
      },
      {
          "country_name": "GERMANY",
          "count": 3
      },
      {
          "country_name": "ITALY",
          "count": 4
      },
]

I don't have big file to check speed and compare which one is faster我没有大文件来检查速度并比较哪个更快


You can try to use collections.Counter but it would need for -loop to get value from directiories.您可以尝试使用collections.Counter但它需要for -loop 从目录中获取值。

data = [
    {"country_name": "US"},
    {"country_name": "US"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
]

import collections

c = collections.Counter([x["country_name"] for x in data])

print(c)

Result:结果:

Counter({'ITALY': 5, 'GERMANY': 3, 'US': 2})

You could also convert to pandas.DataFrame (or read from file/database directly to DataFrame ) and make all "calculations" with pandas which uses fast code created in C/C++/Fortran You could also convert to pandas.DataFrame (or read from file/database directly to DataFrame ) and make all "calculations" with pandas which uses fast code created in C/C++/Fortran

data = [
    {"country_name": "US"},
    {"country_name": "US"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
]

import pandas as pd

df = pd.DataFrame(data)
#print(df)

print( df['country_name'].value_counts() )

Result:结果:

ITALY      5
GERMANY    3
US         2
Name: country_name, dtype: int64

Other method is to keep data in database and use SQL for this - usually database engine is faster then Python's loop.其他方法是将数据保存在数据库中并为此使用SQL - 通常数据库引擎比 Python 的循环更快。

SELECT country_name, COUNT(*) FROM data GROUP BY country_name

data = [
    {"country_name": "US"},
    {"country_name": "US"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
]

import sqlite3


def init(db, data):
    cur = db.cursor()

    cur.execute('DROP TABLE IF EXISTS data')

    cur.execute('CREATE TABLE IF NOT EXISTS data (id INTEGER PRIMARY KEY AUTOINCREMENT, country_name STRING)')

    for item in data:
        cur.execute('INSERT INTO data (country_name) VALUES (?)', (item['country_name'], ))

    db.commit()

    cur.close()

# ---

db = sqlite3.Connection('data.db')

init(db, data)  # use it only once

cur = db.cursor()

#result = cur.execute('SELECT * FROM data')
result = cur.execute('SELECT country_name, COUNT(*) FROM data GROUP BY country_name')
for row in result:
    print('row:', row)

cur.close()

Result:结果:

row: ('GERMANY', 3)
row: ('ITALY', 5)
row: ('US', 2)

EDIT: Other method is to use faster server then local computer - for example free Google Colab编辑:其他方法是使用比本地计算机更快的服务器 - 例如免费的Google Colab

collections.defaultdict is usually quite fast for counting: collections.defaultdict的计数通常非常快:

from collections import defaultdict
from pprint import pprint

countries = [
    {"country_name": "US"},
    {"country_name": "US"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
]

counts = defaultdict(int)
for country in countries:
    counts[country["country_name"]] += 1

pprint([{"country_name": k, "count": v} for k, v in counts.items()])

Output: Output:

[{'count': 2, 'country_name': 'US'},
 {'count': 3, 'country_name': 'GERMANY'},
 {'count': 5, 'country_name': 'ITALY'}]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM