简体   繁体   中英

Most efficient way to group, count, then sort?

The data is two columns, City, I need to group by city based on sum.

Table looks something like this (times a million):

City, People
Boston, 1000
Boston, 2000
New York, 2500
Chicago, 2000

In this case Boston would be number 1 with 3000 people. I would need to return the top 5% cities and their people count (sum).

What is the most efficient way to do this? Can pandas scale this up well? Should I keep track of the top 5% or do a sort at the end?

If you would prefer to use Python without external libraries, you could do as follows. First, I open the file with csv . Then we can use the built in sorted function to sort our array at a custom key (basically, check the second element). Then we grab the part we want with a [] .

import csv, math

out = []
with open("data.csv","r") as fi:
    inCsv = csv.reader(fi,delimiter=',')
    for row in inCsv:
        out.append([col.strip() for col in row])
print (sorted(out[1:], key=lambda a: a[1], reverse=True)[:int(math.ceil(len(out)*.05))])
  • groupby to get sums
  • rank to get perctiles

df = pd.read_csv(skipinitialspace=True)
d1 = df.groupby('City').People.sum()
d1.loc[d1.rank(pct=True) >= .95]

City
Boston    3000
Name: People, dtype: int64

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM