The data is two columns, City, I need to group by city based on sum.
Table looks something like this (times a million):
City, People
Boston, 1000
Boston, 2000
New York, 2500
Chicago, 2000
In this case Boston would be number 1 with 3000 people. I would need to return the top 5% cities and their people count (sum).
What is the most efficient way to do this? Can pandas scale this up well? Should I keep track of the top 5% or do a sort at the end?
If you would prefer to use Python without external libraries, you could do as follows. First, I open the file with csv
. Then we can use the built in sorted
function to sort our array at a custom key (basically, check the second element). Then we grab the part we want with a []
.
import csv, math
out = []
with open("data.csv","r") as fi:
inCsv = csv.reader(fi,delimiter=',')
for row in inCsv:
out.append([col.strip() for col in row])
print (sorted(out[1:], key=lambda a: a[1], reverse=True)[:int(math.ceil(len(out)*.05))])
groupby
to get sums rank
to get perctiles df = pd.read_csv(skipinitialspace=True)
d1 = df.groupby('City').People.sum()
d1.loc[d1.rank(pct=True) >= .95]
City
Boston 3000
Name: People, dtype: int64
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.