简体   繁体   English

pandas DataFrame-如何对行进行分组和标记

[英]pandas DataFrame - how to group and tag rows

I have a large set of data that I want to extract two columns, which I managed to do so with the code below: 我要提取两列的数据很大,我设法用下面的代码做到这一点:

import pandas as pd
import numpy as np
import os


pickupfile = 'pickuplist.xls'

path = os.chdir('some path')
files = os.listdir(path)
files_xls = [f for f in files if f[-3:] == 'xls']

df = pd.DataFrame()
pl = pd.ExcelFile(pickupfile)
pickuplist = pd.read_excel(pl)

df = [pd.read_excel(f, 'Sheet1')[['Exp. m/z','Intensity']] for f in files_xls]

plistcollect = pd.concat(df, keys=files_xls)\
                 .reset_index(level=1, drop=True)\
                 .rename_axis('Tag')\
                 .reset_index()

Each file from pk list folder contains 10 columns, and the code above pulls two columns from the file into plistcollect dataframe. pk列表文件夹中的每个文件包含10列,上面的代码将文件中的两列拉入plistcollect数据帧。 The downside for me is that the file pulling iteration appends the data to the bottom of the previous data. 对我来说,不利的是文件拉取迭代会将数据附加到先前数据的底部。 A data looks like: 数据如下:

Number    Exp. m/z    Intensity
1         1013.33     1000
2         1257.52     2000

and so on, and with append: 依此类推,并附加:

Number    Exp. m/z    Intensity
1         1013.33     1000
2         1257.52     2000
3         1013.35     3000
4         1257.61     4000

where 1~2 are from the first file, and 3~4 are from the second file, and so on. 其中1〜2来自第一个文件,3〜4来自第二个文件,依此类推。 Each file has varying number of rows or indices (ie file 1 has 400 rows, file 2 has 501 rows, etc.), which is causing some problems down the line for my code. 每个文件具有不同数量的行或索引(即文件1具有400行,文件2具有501行,等等),这在我的代码行下引起了一些问题。 So the question is, is there a way to tag each file so that when the files are iterated for appending to plistcollect, the rows of plistcollect DataFrame are tagged with the names of the files, so that I can perform binning for each tag? 因此,问题是, 是否有一种方法标记每个文件,以便在迭代文件以将其追加到plistcollect时,使用文件名标记plistcollect DataFrame的行,以便我可以对每个标记执行装箱?


As a side note, after defining plistcollect, I perform matching by: 附带说明一下,在定义plistcollect之后,我通过以下方式执行匹配:

ppm = 150

matches = pd.DataFrame(index=pickuplist['mass'], columns=plistcollect.set_index(list(plistcollect.columns)).index, dtype=bool)

for index, findex, exp_mass, intensity in plistcollect.itertuples():
    matches[findex, exp_mass] = abs(matches.index - exp_mass) / matches.index < ppm / 1e6


results = {i: list(s.index[s]) for i, s in matches.iterrows()}
results2 = {key for key, value in matches.any().iteritems() if value}
results3 = matches.any().reset_index()[matches.any().values]

which picks up those Exp. 拿起那些经验。 m/z values that fall within ppm differences (150 ppm), still in the same format as plistcollect. m / z值落入ppm差异(150 ppm)之内,仍然与plistcollect格式相同。 Then I do binning with np.digitize by: 然后我通过以下方式使用np.digitize进行装箱:

bins = np.arange(900, 3000, 1)

groups = results3.groupby(np.digitize(results3['Exp. m/z'], bins))


stdev = groups['Intensity'].std()
average = groups['Intensity'].mean()
CV = stdev/average*100



resulttable = pd.concat([groups['Exp. m/z'].mean(),average,CV], axis=1)


resulttable.columns.values[1] = 'Average'
resulttable.columns.values[2] = 'CV'


resulttable.to_excel('test.xls', index=False)

This gives me what I want in terms of raw data analysis like (please note that the numbers for this table do not correspond to the example table above): 这给了我我想要的原始数据分析方面的东西(请注意,此表的数字与上面的示例表不对应):

Exp. m/z    Average     CV
1013.32693  582361.5354 13.49241757
1257.435414 494927.0904 12.45206038

However, I want to normalize the intensity values for EACH data file, so I thought that the binning should be done on the separate data for each file. 但是,我想归一化每个数据文件的强度值,因此我认为应该对每个文件的单独数据进行装箱。 Hence, why I am asking if there is a way to tag the rows for plistcollect with regard to each corresponding file. 因此,为什么我要问是否有一种方法可以针对每个对应的文件为plistcollect标记行。 Also please note that the matching process must be done before normalization. 另外请注意,匹配过程必须在归一化之前完成。 The normalization would be to divide each intensity value by the sum of the intensity values from the same data file. 归一化是将每个强度值除以来自同一数据文件的强度值之和。 Using the example Table above, the normalized intensity for 1013.33 would be: 1000/(1000+2000), and that for 1013.35 would be: 3000/(3000+4000). 使用上面的示例表,1013.33的归一化强度将是:1000 /(1000 + 2000),而1013.35的归一化强度将是:3000 /(3000 + 4000)。

I can calculate the sum of all the values within each bin with no problem, but I can't seem to find a way to find the sum of intensity values between bins that correspond to where the values have come from the appended files. 我可以毫无问题地计算出每个档位中所有值的总和,但似乎找不到一种方法来找到与附加值来自何处的档位之间的强度值之和。

EDIT: 编辑:

I edited the code to reflect the answer, along with adding 'findex' to the matches dataframe. 我编辑了代码以反映答案,并在匹配数据框中添加了“ findex”。 Now the results3 dataframe seems to contain the file names as tags. 现在,results3数据框似乎包含文件名作为标记。 groups dataframe also seems to have Tag values as well. groups数据框似乎也具有Tag值。 The question is, how do I designate/group by the tag names? 问题是,如何按标签名称指定/分组?

filetags = groups['Tag']
resulttable = pd.concat([filetags, groups['Exp. m/z'].mean(), average, CV], axis=1)

produces error message of: cannot concatenate a non-NDFrame object. 产生以下错误消息:无法连接非NDFrame对象。

Edit2: The pickuplist.xls file contains a column named 'mass' that simply has a list of Exp. Edit2: pickuplist.xls文件包含名为“ mass”的列,该列仅具有Exp列表。 m/z values that I use to pickup the obtained Exp. 我用来获取获得的Exp的m / z值。 m/z values from the appended files (where ppm 150 comes in, so those Exp. m/z values that fall within 150 ppm difference (abs(mass - mass_from_file)/mass*1000000 = 150). the pickuplist.xls looks like: 来自附加文件的m / z值(其中包含ppm 150,因此那些Exp。m / z值落在150 ppm差异之内(abs(mass-mass_from_file)/ mass * 1000000 = 150)。 :

mass
1013.34
1079.3757
1095.3706
1136.3972
1241.4285
1257.4234

These are what I call known pickup list, and each file may or may not contain these mass values. 这些就是我所谓的拾取列表,每个文件可能包含也可能不包含这些质量值。 And the matches definition actually also came from one of the kind users of Stack Overflow. 而matchs定义实际上也来自Stack Overflow的一种用户。 It is used to iterate over the plistcollect, and selects those Exp. 它用于遍历plistcollect并选择那些Exp。 m/z values that fall within 150 ppm difference from 'mass'. m / z值与“质量”的差落在150 ppm之内。

I think you can use parameter keys in concat : 我认为您可以在concat使用参数keys

dfs = []
for f in files_xls:
    dfs = pd.read_excel(f, 'Sheet1')[['Exp. m/z','Intensity']]
    dfs.append(data)

It is same as: 与以下内容相同:

dfs = [pd.read_excel(f, 'Sheet1')[['Exp. m/z','Intensity']] for f in files_xls]

plistcollect = pd.concat(dfs, keys=files_xls) \
                 .reset_index(level=1, drop=True) \
                 .rename_axis('Tag') \
                 .reset_index()
print (plistcollect)
         Tag  Exp.m/z  Intensity
0  test1.xls  1013.33       1000
1  test1.xls  1257.52       2000
2  test2.xls  1013.35       3000
3  test2.xls  1257.61       4000

EDIT: 编辑:

I think I got it. 我想我明白了。 Need Tag column add to matches first and then groupby by np.digitize with Tag column: Need Tag列首先添加到匹配项,然后按np.digitizeTag列进行np.digitize

print (plist)
         Tag  Exp. m/z  Intensity
0  test1.xls      1000       2000
1  test1.xls      1000       1500
2  test1.xls      2000       3000
3  test2.xls      3000       4000
4  test2.xls      4000       5000
5  test2.xls      4000       5500

pickup = pd.DataFrame({'mass':[1000,1200,1300, 4000]})
print (pickup)
   mass
0  1000
1  1200
2  1300
3  4000

matches = pd.DataFrame(index=pickup['mass'], 
                       columns = plist.set_index(list(plist.columns)).index, 
                       dtype=bool)

ppm = 150
for index, tags, exp_mass, intensity in plist.itertuples():
    matches[(tags, exp_mass)] = abs(matches.index - exp_mass) / matches.index < ppm / 1e6

print (matches)
Tag       test1.xls               test2.xls              
Exp. m/z       1000          2000      3000   4000       
Intensity      2000   1500   3000      4000   5000   5500
mass                                                     
1000           True   True  False     False  False  False
1200          False  False  False     False  False  False
1300          False  False  False     False  False  False
4000          False  False  False     False   True   True

results3 = matches.any().reset_index(name='a')[matches.any().values]
print (results3)
         Tag  Exp. m/z  Intensity     a
0  test1.xls      1000       2000  True
1  test1.xls      1000       1500  True
4  test2.xls      4000       5000  True
5  test2.xls      4000       5500  True

bins = np.arange(900, 3000, 1)
groups = results3.groupby([np.digitize(results3['Exp. m/z'], bins), 'Tag'])

resulttable = groups.agg({'Intensity':['mean','std'], 'Exp. m/z': 'mean'})
resulttable.columns = resulttable.columns.map('_'.join)
resulttable['CV'] = resulttable['Intensity_std'] / resulttable['Intensity_mean'] * 100
d = {'Intensity_mean':'Average','Exp. m/z_mean':'Exp. m/z'}
resulttable = resulttable.reset_index().rename(columns=d) \
                          .drop(['Intensity_std', 'level_0'],axis=1)
print (resulttable)
         Tag  Average  Exp. m/z         CV
0  test1.xls     1750      1000  20.203051
1  test2.xls     5250      4000   6.734350

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM