[英]Which is the fastest way to count elements in array
我有一个包含国家名称的大量列表,很多次。 我想创建一个新列表(较小),其中包含一个国家被包含在较大列表中的次数的总和。
[
{
"country_name": "US"
},
{
"country_name": "US"
},
{
"country_name": "GERMANY"
},
{
"country_name": "GERMANY"
},
{
"country_name": "GERMANY"
},
{
"country_name": "ITALY"
},
{
"country_name": "ITALY"
},
{
"country_name": "ITALY"
},
{
"country_name": "ITALY"
}
]
这是使用此数据生成新列表的最快方法。 我想要这样的东西:我可以用 for 循环来做,但它太慢了。 有什么选择吗?
[
{
"country_name": "US",
"count": 2
},
{
"country_name": "GERMANY",
"count": 3
},
{
"country_name": "ITALY",
"count": 4
},
]
我没有大文件来检查速度并比较哪个更快
您可以尝试使用collections.Counter
但它需要for
-loop 从目录中获取值。
data = [
{"country_name": "US"},
{"country_name": "US"},
{"country_name": "GERMANY"},
{"country_name": "GERMANY"},
{"country_name": "GERMANY"},
{"country_name": "ITALY"},
{"country_name": "ITALY"},
{"country_name": "ITALY"},
{"country_name": "ITALY"},
{"country_name": "ITALY"},
]
import collections
c = collections.Counter([x["country_name"] for x in data])
print(c)
结果:
Counter({'ITALY': 5, 'GERMANY': 3, 'US': 2})
You could also convert to pandas.DataFrame
(or read from file/database directly to DataFrame
) and make all "calculations" with pandas
which uses fast code created in C/C++/Fortran
data = [
{"country_name": "US"},
{"country_name": "US"},
{"country_name": "GERMANY"},
{"country_name": "GERMANY"},
{"country_name": "GERMANY"},
{"country_name": "ITALY"},
{"country_name": "ITALY"},
{"country_name": "ITALY"},
{"country_name": "ITALY"},
{"country_name": "ITALY"},
]
import pandas as pd
df = pd.DataFrame(data)
#print(df)
print( df['country_name'].value_counts() )
结果:
ITALY 5
GERMANY 3
US 2
Name: country_name, dtype: int64
其他方法是将数据保存在数据库中并为此使用SQL
- 通常数据库引擎比 Python 的循环更快。
SELECT country_name, COUNT(*) FROM data GROUP BY country_name
data = [
{"country_name": "US"},
{"country_name": "US"},
{"country_name": "GERMANY"},
{"country_name": "GERMANY"},
{"country_name": "GERMANY"},
{"country_name": "ITALY"},
{"country_name": "ITALY"},
{"country_name": "ITALY"},
{"country_name": "ITALY"},
{"country_name": "ITALY"},
]
import sqlite3
def init(db, data):
cur = db.cursor()
cur.execute('DROP TABLE IF EXISTS data')
cur.execute('CREATE TABLE IF NOT EXISTS data (id INTEGER PRIMARY KEY AUTOINCREMENT, country_name STRING)')
for item in data:
cur.execute('INSERT INTO data (country_name) VALUES (?)', (item['country_name'], ))
db.commit()
cur.close()
# ---
db = sqlite3.Connection('data.db')
init(db, data) # use it only once
cur = db.cursor()
#result = cur.execute('SELECT * FROM data')
result = cur.execute('SELECT country_name, COUNT(*) FROM data GROUP BY country_name')
for row in result:
print('row:', row)
cur.close()
结果:
row: ('GERMANY', 3)
row: ('ITALY', 5)
row: ('US', 2)
编辑:其他方法是使用比本地计算机更快的服务器 - 例如免费的Google Colab
collections.defaultdict
的计数通常非常快:
from collections import defaultdict
from pprint import pprint
countries = [
{"country_name": "US"},
{"country_name": "US"},
{"country_name": "GERMANY"},
{"country_name": "GERMANY"},
{"country_name": "GERMANY"},
{"country_name": "ITALY"},
{"country_name": "ITALY"},
{"country_name": "ITALY"},
{"country_name": "ITALY"},
{"country_name": "ITALY"},
]
counts = defaultdict(int)
for country in countries:
counts[country["country_name"]] += 1
pprint([{"country_name": k, "count": v} for k, v in counts.items()])
Output:
[{'count': 2, 'country_name': 'US'},
{'count': 3, 'country_name': 'GERMANY'},
{'count': 5, 'country_name': 'ITALY'}]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.