简体   繁体   中英

Selecting the top 50 % percentage names from the columns of a pandas dataframe

I have a pandas dataframe that looks like this. The rows and the columns have the same name.

name a  b  c  d  e  f  g 
 a   10 5  4  8  5  6  4
 b   5  10 6  5  4  3  3
 c   -  4  9  3  6  5  7
 d   6  9  8  6  6  8  2
 e   8  5  4  4  14 9  6
 f   3  3  -  4  5  14 7
 g   4  5  8  9  6  7  10

I can get the 5 number of largest values by passing df['column_name'].nlargest(n=5) but if I have to return 50 % of the largest in descending order, is there anything that is inbuilt in pandas of it I have to write a function for it, how can I get them? I am quite new to python. Please help me out.

UPDATE : So let's take column a into consideration and it has values like 10, 5,-,6,8,3 and 4. I have to sum all of them up and get the top 50% of them. so the total in this case is 36. 50% of these values would be 18. So from column a, I want to select 10 and 8 only. Similarly I want to go through all the other columns and select 50%.

Sorting is flexible :)

df.sort_values('column_name',ascending=False).head(int(df.shape[0]*.5))

Update: frac argument is available only on .sample(), not in .head or .tail. df.sample(frac=.5) does give 50% but head and tail expects only int. df.head(frac=.5) fails with TypeError: head() got an unexpected keyword argument 'frac'

Note: on int() vs round()

int(3.X) == 3 # True Where 0 >= X >=9 
round(3.45) == 3 # True
round(3.5) == 4 # True

So when doing .head(int/round ...) do think of what behaviour fits your need.

Updated: Requirements

So let's take column a into consideration and it has values like 10, 5,-,6,8,3 and 4. I have to sum all of them up and get the top 50% of them. so the total, in this case, is 36. 50% of these values would be 18. So from column a, I want to select 10 and 8 only. Similarly, I want to go through all the other columns and select 50%. -Matt

A silly hack would be to sort, find the cumulative sum, find the middle by dividing it with the sum total and then use that to select part of your sorted column. eg

import pandas as pd

data = pd.read_csv(
pd.compat.StringIO("""name a b c d e f g 
a 10 5 4 8 5 6 4
b 5 10 6 5 4 3 3
c - 4 9 3 6 5 7
d 6 9 8 6 6 8 2
e 8 5 4 4 14 9 6
f 3 3 - 4 5 14 7
g 4 5 8 9 6 7 10"""), 
sep=' ', index_col='name'
).dropna(axis=1).apply(
pd.to_numeric, errors='coerce', downcast='signed')

x = data[['a']].sort_values(by='a',ascending=False)[(data[['a']].sort_values(by='a',ascending=False).cumsum()
                                                 /data[['a']].sort_values(by='a',ascending=False).sum())<=.5].dropna()
print(x)

Outcome: 在此处输入图片说明

您可以对数据框进行排序,仅显示90%的数据

df.sort_values('column_name',ascending=False).head(round(0.9*len(df)))

data.csv

name,a,b,c,d,e,f,g
a,10,5,4,8,5,6,4
b,5,10,6,5,4,3,3
c,-,4,9,3,6,5,7
d,6,9,8,6,6,8,2
e,8,5,4,4,14,9,6
f,3,3,-,4,5,14,7
g,4,5,8,9,6,7,10

test.py

#!/bin/python

import pandas as pd

def percentageOfList(l, p):
    return l[0:int(len(l) * p)]

df = pd.read_csv('data.csv')
print(percentageOfList(df.sort_values('b', ascending=False)['b'], 0.9))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM