I have a large set of data that I want to extract two columns, which I managed to do so with the code below:
import pandas as pd
import numpy as np
import os
pickupfile = 'pickuplist.xls'
path = os.chdir('some path')
files = os.listdir(path)
files_xls = [f for f in files if f[-3:] == 'xls']
df = pd.DataFrame()
pl = pd.ExcelFile(pickupfile)
pickuplist = pd.read_excel(pl)
df = [pd.read_excel(f, 'Sheet1')[['Exp. m/z','Intensity']] for f in files_xls]
plistcollect = pd.concat(df, keys=files_xls)\
.reset_index(level=1, drop=True)\
.rename_axis('Tag')\
.reset_index()
Each file from pk list folder contains 10 columns, and the code above pulls two columns from the file into plistcollect dataframe. The downside for me is that the file pulling iteration appends the data to the bottom of the previous data. A data looks like:
Number Exp. m/z Intensity
1 1013.33 1000
2 1257.52 2000
and so on, and with append:
Number Exp. m/z Intensity
1 1013.33 1000
2 1257.52 2000
3 1013.35 3000
4 1257.61 4000
where 1~2 are from the first file, and 3~4 are from the second file, and so on. Each file has varying number of rows or indices (ie file 1 has 400 rows, file 2 has 501 rows, etc.), which is causing some problems down the line for my code. So the question is, is there a way to tag each file so that when the files are iterated for appending to plistcollect, the rows of plistcollect DataFrame are tagged with the names of the files, so that I can perform binning for each tag?
As a side note, after defining plistcollect, I perform matching by:
ppm = 150
matches = pd.DataFrame(index=pickuplist['mass'], columns=plistcollect.set_index(list(plistcollect.columns)).index, dtype=bool)
for index, findex, exp_mass, intensity in plistcollect.itertuples():
matches[findex, exp_mass] = abs(matches.index - exp_mass) / matches.index < ppm / 1e6
results = {i: list(s.index[s]) for i, s in matches.iterrows()}
results2 = {key for key, value in matches.any().iteritems() if value}
results3 = matches.any().reset_index()[matches.any().values]
which picks up those Exp. m/z values that fall within ppm differences (150 ppm), still in the same format as plistcollect. Then I do binning with np.digitize by:
bins = np.arange(900, 3000, 1)
groups = results3.groupby(np.digitize(results3['Exp. m/z'], bins))
stdev = groups['Intensity'].std()
average = groups['Intensity'].mean()
CV = stdev/average*100
resulttable = pd.concat([groups['Exp. m/z'].mean(),average,CV], axis=1)
resulttable.columns.values[1] = 'Average'
resulttable.columns.values[2] = 'CV'
resulttable.to_excel('test.xls', index=False)
This gives me what I want in terms of raw data analysis like (please note that the numbers for this table do not correspond to the example table above):
Exp. m/z Average CV
1013.32693 582361.5354 13.49241757
1257.435414 494927.0904 12.45206038
However, I want to normalize the intensity values for EACH data file, so I thought that the binning should be done on the separate data for each file. Hence, why I am asking if there is a way to tag the rows for plistcollect with regard to each corresponding file. Also please note that the matching process must be done before normalization. The normalization would be to divide each intensity value by the sum of the intensity values from the same data file. Using the example Table above, the normalized intensity for 1013.33 would be: 1000/(1000+2000), and that for 1013.35 would be: 3000/(3000+4000).
I can calculate the sum of all the values within each bin with no problem, but I can't seem to find a way to find the sum of intensity values between bins that correspond to where the values have come from the appended files.
EDIT:
I edited the code to reflect the answer, along with adding 'findex' to the matches dataframe. Now the results3 dataframe seems to contain the file names as tags. groups dataframe also seems to have Tag values as well. The question is, how do I designate/group by the tag names?
filetags = groups['Tag']
resulttable = pd.concat([filetags, groups['Exp. m/z'].mean(), average, CV], axis=1)
produces error message of: cannot concatenate a non-NDFrame object.
Edit2: The pickuplist.xls file contains a column named 'mass' that simply has a list of Exp. m/z values that I use to pickup the obtained Exp. m/z values from the appended files (where ppm 150 comes in, so those Exp. m/z values that fall within 150 ppm difference (abs(mass - mass_from_file)/mass*1000000 = 150). the pickuplist.xls looks like:
mass
1013.34
1079.3757
1095.3706
1136.3972
1241.4285
1257.4234
These are what I call known pickup list, and each file may or may not contain these mass values. And the matches definition actually also came from one of the kind users of Stack Overflow. It is used to iterate over the plistcollect, and selects those Exp. m/z values that fall within 150 ppm difference from 'mass'.
I think you can use parameter keys
in concat
:
dfs = []
for f in files_xls:
dfs = pd.read_excel(f, 'Sheet1')[['Exp. m/z','Intensity']]
dfs.append(data)
It is same as:
dfs = [pd.read_excel(f, 'Sheet1')[['Exp. m/z','Intensity']] for f in files_xls]
plistcollect = pd.concat(dfs, keys=files_xls) \
.reset_index(level=1, drop=True) \
.rename_axis('Tag') \
.reset_index()
print (plistcollect)
Tag Exp.m/z Intensity
0 test1.xls 1013.33 1000
1 test1.xls 1257.52 2000
2 test2.xls 1013.35 3000
3 test2.xls 1257.61 4000
EDIT:
I think I got it. Need Tag
column add to matches first and then groupby by np.digitize
with Tag
column:
print (plist)
Tag Exp. m/z Intensity
0 test1.xls 1000 2000
1 test1.xls 1000 1500
2 test1.xls 2000 3000
3 test2.xls 3000 4000
4 test2.xls 4000 5000
5 test2.xls 4000 5500
pickup = pd.DataFrame({'mass':[1000,1200,1300, 4000]})
print (pickup)
mass
0 1000
1 1200
2 1300
3 4000
matches = pd.DataFrame(index=pickup['mass'],
columns = plist.set_index(list(plist.columns)).index,
dtype=bool)
ppm = 150
for index, tags, exp_mass, intensity in plist.itertuples():
matches[(tags, exp_mass)] = abs(matches.index - exp_mass) / matches.index < ppm / 1e6
print (matches)
Tag test1.xls test2.xls
Exp. m/z 1000 2000 3000 4000
Intensity 2000 1500 3000 4000 5000 5500
mass
1000 True True False False False False
1200 False False False False False False
1300 False False False False False False
4000 False False False False True True
results3 = matches.any().reset_index(name='a')[matches.any().values]
print (results3)
Tag Exp. m/z Intensity a
0 test1.xls 1000 2000 True
1 test1.xls 1000 1500 True
4 test2.xls 4000 5000 True
5 test2.xls 4000 5500 True
bins = np.arange(900, 3000, 1)
groups = results3.groupby([np.digitize(results3['Exp. m/z'], bins), 'Tag'])
resulttable = groups.agg({'Intensity':['mean','std'], 'Exp. m/z': 'mean'})
resulttable.columns = resulttable.columns.map('_'.join)
resulttable['CV'] = resulttable['Intensity_std'] / resulttable['Intensity_mean'] * 100
d = {'Intensity_mean':'Average','Exp. m/z_mean':'Exp. m/z'}
resulttable = resulttable.reset_index().rename(columns=d) \
.drop(['Intensity_std', 'level_0'],axis=1)
print (resulttable)
Tag Average Exp. m/z CV
0 test1.xls 1750 1000 20.203051
1 test2.xls 5250 4000 6.734350
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.