简体   繁体   中英

How to describe (mean, median, count, etc.) all two-factor column combinations in a matrix with python?

I have a pandas dataframe that looks like something like this:

在此处输入图片说明

Every value in a given row is either the same number or a NaN. I want to calculate the mean, median, and get counts for all 2-column combinations in the dataframe, where neither of the columns is a NaN.

So, for instance, the result of the above dataframe would be:

AB: count: 1, mean: 7, median: 7 
AC: count: 2, mean: 9.5, median: 9.5 
BC: count: 2, mean: 9, median: 9

In fact, my dataframe is about 50k rows long, and 40 or so columns wide.

In case you were wondering, this is for work related to the Stack Overflow Developer Survey. Ami Tavory helped me get to this point. Rows are respondents. Columns in this case are programming languages that respondents tell us they use. And values are your annual salary. I'm trying to determine what programming language combination (a proxy for coding ecosystem perhaps) pays the best. The results will be published in the next couple weeks. Our real devs are busy building real things, so I figured I'd take the opportunity to poke you instead. I look forward to your checking my work when we release a full data dump in the next month or so.

You can generate the sample dataframe with this code:

df = pd.DataFrame({'A' : [12,np.nan,np.nan,7],
                   'B' : [np.nan,11,8,7],
                   'C' : [12,11,np.nan,7]})

I tried to make this reasonably scalable for you - hence using lists instead of doing it all in pandas. The only good way I saw for doing this in pandas would require a lot of row-wise operations, which are really slow in pandas. It's fairly easy to add attributes here - just add a column in the array called outarr and name it when you create the output dataframe.

import pandas as pd, numpy as np
import itertools
df = pd.DataFrame({'A' : [12,np.nan,np.nan,7],
                   'B' : [np.nan,11,8,7],
                   'C' : [12,11,np.nan,7]})

cols = df.columns.values #Columns from your dataframe
collist = list(itertools.combinations(cols,2)) #All combinations of columns from your df

#Create numpy array for each two-column combo and calculate count, mean, median
outarr = [0]*len(collist)
for ix, coltuple in enumerate(collist):
    a = df[list(coltuple)].dropna().values
    outarr[ix] = [a.shape[0],np.mean(a),np.median(a)]

#Create output dataframe
dfout = pd.DataFrame(outarr,index = collist,columns=['count','mean','median'])
dfout

Out[41]:
        count   mean    median
(A, B)  1       7.0     7.0
(A, C)  2       9.5     9.5
(B, C)  2       9.0     9.0

This should work (it works on your sample, but I haven't tested it on a larger dataset):

(nrow, ncol) = df.shape
for i in range(0,nrow-1):
    for j in range(i+1, nrow-1):
        temp = df.iloc[:,[i,j]].reset_index()
        temp.dropna(inplace=True)
        print temp.columns[1:].tolist(), len(temp), temp.ix[:,1].mean(), temp.ix[:,1].median()

which for your example gives

['A', 'B'] 1 7.0 7.0
['A', 'C'] 2 9.5 9.5
['B', 'C'] 2 9.0 9.0

You create a new dataframe for each pair of columns and drop anything with a 'NA' and then do the basic statistics on that temporary dataframe. There may be a more efficient way to do this, but your dataframe is small enough this shouldn't be a major problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM