简体   繁体   中英

Which is the fastest way to count elements in array

I have a massive list which contains a country_names, many times. I would like to create a new list (smaller) which will contain a sum for how many times a country is included in the bigger list.

[
      {
          "country_name": "US"
      },
      {
          "country_name": "US"
      },
      {
          "country_name": "GERMANY"
      },
      {
          "country_name": "GERMANY"
      },
      {
          "country_name": "GERMANY"
      },
      {
          "country_name": "ITALY"
      },
      {
          "country_name": "ITALY"
      },
      {
          "country_name": "ITALY"
      },
      {
          "country_name": "ITALY"
      }
]

which is the fastest way to generate a new list with this data. I would like something like this: I can do it with a for loop but it is way too slow. Any alternatives?

[
      {
          "country_name": "US",
          "count": 2
      },
      {
          "country_name": "GERMANY",
          "count": 3
      },
      {
          "country_name": "ITALY",
          "count": 4
      },
]

I don't have big file to check speed and compare which one is faster


You can try to use collections.Counter but it would need for -loop to get value from directiories.

data = [
    {"country_name": "US"},
    {"country_name": "US"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
]

import collections

c = collections.Counter([x["country_name"] for x in data])

print(c)

Result:

Counter({'ITALY': 5, 'GERMANY': 3, 'US': 2})

You could also convert to pandas.DataFrame (or read from file/database directly to DataFrame ) and make all "calculations" with pandas which uses fast code created in C/C++/Fortran

data = [
    {"country_name": "US"},
    {"country_name": "US"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
]

import pandas as pd

df = pd.DataFrame(data)
#print(df)

print( df['country_name'].value_counts() )

Result:

ITALY      5
GERMANY    3
US         2
Name: country_name, dtype: int64

Other method is to keep data in database and use SQL for this - usually database engine is faster then Python's loop.

SELECT country_name, COUNT(*) FROM data GROUP BY country_name

data = [
    {"country_name": "US"},
    {"country_name": "US"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
]

import sqlite3


def init(db, data):
    cur = db.cursor()

    cur.execute('DROP TABLE IF EXISTS data')

    cur.execute('CREATE TABLE IF NOT EXISTS data (id INTEGER PRIMARY KEY AUTOINCREMENT, country_name STRING)')

    for item in data:
        cur.execute('INSERT INTO data (country_name) VALUES (?)', (item['country_name'], ))

    db.commit()

    cur.close()

# ---

db = sqlite3.Connection('data.db')

init(db, data)  # use it only once

cur = db.cursor()

#result = cur.execute('SELECT * FROM data')
result = cur.execute('SELECT country_name, COUNT(*) FROM data GROUP BY country_name')
for row in result:
    print('row:', row)

cur.close()

Result:

row: ('GERMANY', 3)
row: ('ITALY', 5)
row: ('US', 2)

EDIT: Other method is to use faster server then local computer - for example free Google Colab

collections.defaultdict is usually quite fast for counting:

from collections import defaultdict
from pprint import pprint

countries = [
    {"country_name": "US"},
    {"country_name": "US"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "GERMANY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
    {"country_name": "ITALY"},
]

counts = defaultdict(int)
for country in countries:
    counts[country["country_name"]] += 1

pprint([{"country_name": k, "count": v} for k, v in counts.items()])

Output:

[{'count': 2, 'country_name': 'US'},
 {'count': 3, 'country_name': 'GERMANY'},
 {'count': 5, 'country_name': 'ITALY'}]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM