简体   繁体   中英

How to get average of all values in a file of three replicates using python pandas

I have data in triplicates, I want to get pooled data of all three replicates into one data frame, maintaining the position of value from each row and column. Say, average of value in column 2 row 3 from all replicate files should appear in the new data frame at column 2 row 3. Sample of how the data looks and code that I tried are as follows. Any help is highly appreciated. Thanks

data = {}
for file in glob.glob('results/*.csv'):
    name = check_output(['basename',file,'.csv']).decode().strip()
    data[name] = pd.read_csv(file, index_col = 0, header = 0)
    data[name].columns = pd.to_numeric(data[name].columns)
    
data['file1_A']
        
 A    B       
1.8   1.7     
1.3   1.3    

data['file_B']
A     B       
1.7   1.4     
1.9   1.7

data['file_c']

A     B
1.2   1.6
2.1   2.9

expected outcome

file1

A      B        
1.56   1.56   
1.76   1.96 

i.e.,
A                 B
(1.8+1.7+1.2)/3  (1.7+1.4+1.6)/3
(1.3+1.9+2.1)/3  (1.3+1.7+2.9)/3


#I usually write the following code for small number samples

file1 = (data['file1_A']+data['file1_B']+data['file1_C'])/3


#I tried to write a loop for large number of samples, but it seems like it is not quite right.

files = ['file1_', 'file2_', 'file3_']
totals = {}
for f in files:
    replicates ={}
    for sample, df in totals.items():
        if f in sample:
            replicates[sample] = df
            final_df = df/3


Working with multiple matrices is a job for numpy ! It has a function numpy.mean() which takes the mean (=average) over multiple matrices. The trick is that you have to convert your pandas.DataFrame into a numpy.array and back. Have a look at this example:

import numpy
import pandas
import random
import itertools


# Given that loading the files isn't the problem, I'll create some dummy data here
data = {
    f"file{filenumber}_{filename}": pandas.DataFrame(
        [
            {
                "A": random.random() + random.randint(0, 2),
                "B": random.random() + random.randint(0, 2),
            }
            for _ in range(2)
        ]
    )
    for filenumber, filename in itertools.chain.from_iterable([[(i, l) for l in ["A", "B", "C"]] for i in range(1, 6)])
}


# Loop the files
for filenumber in range(1, 6):

    print(f"Processing files that start with: file{filenumber}_")

    # Convert all files to numpy arrays
    numpy_arrays = [item.to_numpy() for name, item in data.items() if name.startswith(f"file{filenumber}_")]

    # Use numpy to take the mean of each cell, across the frames (mean is the same as summing and dividing by the number of elements)
    means = numpy.mean(numpy_arrays, axis=0)

    # Convert back to a dataframe
    df = pandas.DataFrame(means, columns=data[f"file{filenumber}_A"].columns)

    # Or in a single line
    df = pandas.DataFrame(numpy.mean([item.to_numpy() for name, item in data.items() if name.startswith(f"file{filenumber}_")], axis=0), columns=data[f"file{filenumber}_A"].columns)
    print(df)

Seems like answer is quite easy. Here is a simple loop that worked to get average matrix of all replicates.

#load all files into an empty dictionary
data = {}
for file in glob.glob('results/*.csv'):
name = check_output(['basename',file,'.csv']).decode().strip()
data[name] = pd.read_csv(file, index_col = 0, header = 0)
data[name].columns = pd.to_numeric(data[name].columns)

# write a loop to get an average of matrices of replicates
files = ['file1_', 'file2_', 'file3_']
totals = {}
for f in files:
df = (data[f + 'A']+ data[f + 'B']+data[f + 'C'])/3
totals[f] = df

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM