Most efficient way to group, count, then sort?

Question

The data is two columns, City, I need to group by city based on sum.

Table looks something like this (times a million):

City, People
Boston, 1000
Boston, 2000
New York, 2500
Chicago, 2000

In this case Boston would be number 1 with 3000 people. I would need to return the top 5% cities and their people count (sum).

What is the most efficient way to do this? Can pandas scale this up well? Should I keep track of the top 5% or do a sort at the end?

Answer 1

If you would prefer to use Python without external libraries, you could do as follows. First, I open the file with csv . Then we can use the built in sorted function to sort our array at a custom key (basically, check the second element). Then we grab the part we want with a [] .

import csv, math

out = []
with open("data.csv","r") as fi:
    inCsv = csv.reader(fi,delimiter=',')
    for row in inCsv:
        out.append([col.strip() for col in row])
print (sorted(out[1:], key=lambda a: a[1], reverse=True)[:int(math.ceil(len(out)*.05))])

Answer 2

groupby to get sums
rank to get perctiles

df = pd.read_csv(skipinitialspace=True)
d1 = df.groupby('City').People.sum()
d1.loc[d1.rank(pct=True) >= .95]

City
Boston    3000
Name: People, dtype: int64

Most efficient way to group, count, then sort?

Question

2 answers

solution1
0 2017-03-31 03:49:17

solution2
0 2017-03-31 06:15:55

Most efficient way to group, count, then sort?

Question

2 answers

solution1 0 2017-03-31 03:49:17

solution2 0 2017-03-31 06:15:55

solution1
0 2017-03-31 03:49:17

solution2
0 2017-03-31 06:15:55