I have data in triplicates, I want to get pooled data of all three replicates into one data frame, maintaining the position of value from each row and column. Say, average of value in column 2 row 3 from all replicate files should appear in the new data frame at column 2 row 3. Sample of how the data looks and code that I tried are as follows. Any help is highly appreciated. Thanks
data = {}
for file in glob.glob('results/*.csv'):
name = check_output(['basename',file,'.csv']).decode().strip()
data[name] = pd.read_csv(file, index_col = 0, header = 0)
data[name].columns = pd.to_numeric(data[name].columns)
data['file1_A']
A B
1.8 1.7
1.3 1.3
data['file_B']
A B
1.7 1.4
1.9 1.7
data['file_c']
A B
1.2 1.6
2.1 2.9
expected outcome
file1
A B
1.56 1.56
1.76 1.96
i.e.,
A B
(1.8+1.7+1.2)/3 (1.7+1.4+1.6)/3
(1.3+1.9+2.1)/3 (1.3+1.7+2.9)/3
#I usually write the following code for small number samples
file1 = (data['file1_A']+data['file1_B']+data['file1_C'])/3
#I tried to write a loop for large number of samples, but it seems like it is not quite right.
files = ['file1_', 'file2_', 'file3_']
totals = {}
for f in files:
replicates ={}
for sample, df in totals.items():
if f in sample:
replicates[sample] = df
final_df = df/3
Working with multiple matrices is a job for numpy
! It has a function numpy.mean()
which takes the mean (=average) over multiple matrices. The trick is that you have to convert your pandas.DataFrame
into a numpy.array
and back. Have a look at this example:
import numpy
import pandas
import random
import itertools
# Given that loading the files isn't the problem, I'll create some dummy data here
data = {
f"file{filenumber}_{filename}": pandas.DataFrame(
[
{
"A": random.random() + random.randint(0, 2),
"B": random.random() + random.randint(0, 2),
}
for _ in range(2)
]
)
for filenumber, filename in itertools.chain.from_iterable([[(i, l) for l in ["A", "B", "C"]] for i in range(1, 6)])
}
# Loop the files
for filenumber in range(1, 6):
print(f"Processing files that start with: file{filenumber}_")
# Convert all files to numpy arrays
numpy_arrays = [item.to_numpy() for name, item in data.items() if name.startswith(f"file{filenumber}_")]
# Use numpy to take the mean of each cell, across the frames (mean is the same as summing and dividing by the number of elements)
means = numpy.mean(numpy_arrays, axis=0)
# Convert back to a dataframe
df = pandas.DataFrame(means, columns=data[f"file{filenumber}_A"].columns)
# Or in a single line
df = pandas.DataFrame(numpy.mean([item.to_numpy() for name, item in data.items() if name.startswith(f"file{filenumber}_")], axis=0), columns=data[f"file{filenumber}_A"].columns)
print(df)
Seems like answer is quite easy. Here is a simple loop that worked to get average matrix of all replicates.
#load all files into an empty dictionary
data = {}
for file in glob.glob('results/*.csv'):
name = check_output(['basename',file,'.csv']).decode().strip()
data[name] = pd.read_csv(file, index_col = 0, header = 0)
data[name].columns = pd.to_numeric(data[name].columns)
# write a loop to get an average of matrices of replicates
files = ['file1_', 'file2_', 'file3_']
totals = {}
for f in files:
df = (data[f + 'A']+ data[f + 'B']+data[f + 'C'])/3
totals[f] = df
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.