简体   繁体   中英

pandas DataFrame - how to group and tag rows

I have a large set of data that I want to extract two columns, which I managed to do so with the code below:

import pandas as pd
import numpy as np
import os


pickupfile = 'pickuplist.xls'

path = os.chdir('some path')
files = os.listdir(path)
files_xls = [f for f in files if f[-3:] == 'xls']

df = pd.DataFrame()
pl = pd.ExcelFile(pickupfile)
pickuplist = pd.read_excel(pl)

df = [pd.read_excel(f, 'Sheet1')[['Exp. m/z','Intensity']] for f in files_xls]

plistcollect = pd.concat(df, keys=files_xls)\
                 .reset_index(level=1, drop=True)\
                 .rename_axis('Tag')\
                 .reset_index()

Each file from pk list folder contains 10 columns, and the code above pulls two columns from the file into plistcollect dataframe. The downside for me is that the file pulling iteration appends the data to the bottom of the previous data. A data looks like:

Number    Exp. m/z    Intensity
1         1013.33     1000
2         1257.52     2000

and so on, and with append:

Number    Exp. m/z    Intensity
1         1013.33     1000
2         1257.52     2000
3         1013.35     3000
4         1257.61     4000

where 1~2 are from the first file, and 3~4 are from the second file, and so on. Each file has varying number of rows or indices (ie file 1 has 400 rows, file 2 has 501 rows, etc.), which is causing some problems down the line for my code. So the question is, is there a way to tag each file so that when the files are iterated for appending to plistcollect, the rows of plistcollect DataFrame are tagged with the names of the files, so that I can perform binning for each tag?


As a side note, after defining plistcollect, I perform matching by:

ppm = 150

matches = pd.DataFrame(index=pickuplist['mass'], columns=plistcollect.set_index(list(plistcollect.columns)).index, dtype=bool)

for index, findex, exp_mass, intensity in plistcollect.itertuples():
    matches[findex, exp_mass] = abs(matches.index - exp_mass) / matches.index < ppm / 1e6


results = {i: list(s.index[s]) for i, s in matches.iterrows()}
results2 = {key for key, value in matches.any().iteritems() if value}
results3 = matches.any().reset_index()[matches.any().values]

which picks up those Exp. m/z values that fall within ppm differences (150 ppm), still in the same format as plistcollect. Then I do binning with np.digitize by:

bins = np.arange(900, 3000, 1)

groups = results3.groupby(np.digitize(results3['Exp. m/z'], bins))


stdev = groups['Intensity'].std()
average = groups['Intensity'].mean()
CV = stdev/average*100



resulttable = pd.concat([groups['Exp. m/z'].mean(),average,CV], axis=1)


resulttable.columns.values[1] = 'Average'
resulttable.columns.values[2] = 'CV'


resulttable.to_excel('test.xls', index=False)

This gives me what I want in terms of raw data analysis like (please note that the numbers for this table do not correspond to the example table above):

Exp. m/z    Average     CV
1013.32693  582361.5354 13.49241757
1257.435414 494927.0904 12.45206038

However, I want to normalize the intensity values for EACH data file, so I thought that the binning should be done on the separate data for each file. Hence, why I am asking if there is a way to tag the rows for plistcollect with regard to each corresponding file. Also please note that the matching process must be done before normalization. The normalization would be to divide each intensity value by the sum of the intensity values from the same data file. Using the example Table above, the normalized intensity for 1013.33 would be: 1000/(1000+2000), and that for 1013.35 would be: 3000/(3000+4000).

I can calculate the sum of all the values within each bin with no problem, but I can't seem to find a way to find the sum of intensity values between bins that correspond to where the values have come from the appended files.

EDIT:

I edited the code to reflect the answer, along with adding 'findex' to the matches dataframe. Now the results3 dataframe seems to contain the file names as tags. groups dataframe also seems to have Tag values as well. The question is, how do I designate/group by the tag names?

filetags = groups['Tag']
resulttable = pd.concat([filetags, groups['Exp. m/z'].mean(), average, CV], axis=1)

produces error message of: cannot concatenate a non-NDFrame object.

Edit2: The pickuplist.xls file contains a column named 'mass' that simply has a list of Exp. m/z values that I use to pickup the obtained Exp. m/z values from the appended files (where ppm 150 comes in, so those Exp. m/z values that fall within 150 ppm difference (abs(mass - mass_from_file)/mass*1000000 = 150). the pickuplist.xls looks like:

mass
1013.34
1079.3757
1095.3706
1136.3972
1241.4285
1257.4234

These are what I call known pickup list, and each file may or may not contain these mass values. And the matches definition actually also came from one of the kind users of Stack Overflow. It is used to iterate over the plistcollect, and selects those Exp. m/z values that fall within 150 ppm difference from 'mass'.

I think you can use parameter keys in concat :

dfs = []
for f in files_xls:
    dfs = pd.read_excel(f, 'Sheet1')[['Exp. m/z','Intensity']]
    dfs.append(data)

It is same as:

dfs = [pd.read_excel(f, 'Sheet1')[['Exp. m/z','Intensity']] for f in files_xls]

plistcollect = pd.concat(dfs, keys=files_xls) \
                 .reset_index(level=1, drop=True) \
                 .rename_axis('Tag') \
                 .reset_index()
print (plistcollect)
         Tag  Exp.m/z  Intensity
0  test1.xls  1013.33       1000
1  test1.xls  1257.52       2000
2  test2.xls  1013.35       3000
3  test2.xls  1257.61       4000

EDIT:

I think I got it. Need Tag column add to matches first and then groupby by np.digitize with Tag column:

print (plist)
         Tag  Exp. m/z  Intensity
0  test1.xls      1000       2000
1  test1.xls      1000       1500
2  test1.xls      2000       3000
3  test2.xls      3000       4000
4  test2.xls      4000       5000
5  test2.xls      4000       5500

pickup = pd.DataFrame({'mass':[1000,1200,1300, 4000]})
print (pickup)
   mass
0  1000
1  1200
2  1300
3  4000

matches = pd.DataFrame(index=pickup['mass'], 
                       columns = plist.set_index(list(plist.columns)).index, 
                       dtype=bool)

ppm = 150
for index, tags, exp_mass, intensity in plist.itertuples():
    matches[(tags, exp_mass)] = abs(matches.index - exp_mass) / matches.index < ppm / 1e6

print (matches)
Tag       test1.xls               test2.xls              
Exp. m/z       1000          2000      3000   4000       
Intensity      2000   1500   3000      4000   5000   5500
mass                                                     
1000           True   True  False     False  False  False
1200          False  False  False     False  False  False
1300          False  False  False     False  False  False
4000          False  False  False     False   True   True

results3 = matches.any().reset_index(name='a')[matches.any().values]
print (results3)
         Tag  Exp. m/z  Intensity     a
0  test1.xls      1000       2000  True
1  test1.xls      1000       1500  True
4  test2.xls      4000       5000  True
5  test2.xls      4000       5500  True

bins = np.arange(900, 3000, 1)
groups = results3.groupby([np.digitize(results3['Exp. m/z'], bins), 'Tag'])

resulttable = groups.agg({'Intensity':['mean','std'], 'Exp. m/z': 'mean'})
resulttable.columns = resulttable.columns.map('_'.join)
resulttable['CV'] = resulttable['Intensity_std'] / resulttable['Intensity_mean'] * 100
d = {'Intensity_mean':'Average','Exp. m/z_mean':'Exp. m/z'}
resulttable = resulttable.reset_index().rename(columns=d) \
                          .drop(['Intensity_std', 'level_0'],axis=1)
print (resulttable)
         Tag  Average  Exp. m/z         CV
0  test1.xls     1750      1000  20.203051
1  test2.xls     5250      4000   6.734350

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM