简体   繁体   中英

Extracting specific rows from a multi-indexed Pandas Dataframe to form new DataFrame

I have a data set that I am loading onto a Pandas dataframe that is a Jagged 3-D array called: Waveform. The dataframe is multi-indexed by three levels: Events (Entry), Photons (Subentry) generated by each event, data points (subsubentry) per photon.

The number of data points and Photons varies randomly per each event, hence why it's jagged. I want to extract the Photons (subentries) for each Event (entry) that contain at minimum an "2*n" number of data points, take the average of the first n data points of the selected photons and save them in a new dataframe that contains their respective event and photon index.

I can't put my actual data here because it's too large and jagged so I will create an example that is the same, just scaled down in terms of data.

import awkward as ak
import pandas as pd

#I'm gonna generate an example awkward array that I want to convert to a Pandas DF
wf = ak.to_pandas(ak.Array([ [[1,2,5,6,8,3,21,3],[5986.472,0,6,1,2,3],[0]],[[1]],[[0.1,23,534,21,53,12],[0]],[[1],[2],[0],[12,12,12,12,125,34]],[[76],[23,23,43],],[[0],[12,12,12,12]] ]))

entry subentry subsubentry          
0     0        0               1.000
               1               2.000
               2               5.000
               3               6.000
               4               8.000
               5               3.000
               6              21.000
               7               3.000
      1        0            5986.472
               1               0.000
               2               6.000
               3               1.000
               4               2.000
               5               3.000
      2        0               0.000
1     0        0               1.000
2     0        0               0.100
               1              23.000
               2             534.000
               3              21.000
               4              53.000
               5              12.000
      1        0               0.000
3     0        0               1.000
      1        0               2.000
      2        0               0.000
      3        0              12.000
               1              12.000
               2              12.000
               3              12.000
               4             125.000
               5              34.000
4     0        0              76.000
      1        0              23.000
               1              23.000
               2              43.000
5     0        0               0.000
      1        0              12.000
               1              12.000
               2              12.000
               3              12.000

#This is what I want the filter/Extraction to produce

wf_pF = ak.to_pandas(ak.Array([[[1,2,5,6,8,3,21,3],[5986.472,0,6,1,2,3,5]],[[0.1,23,534,21,53,12]],[[12,12,12,12,125,34]] ]))

entry subentry subsubentry          
0     0        0               1.000
               1               2.000
               2               5.000
               3               6.000
               4               8.000
               5               3.000
               6              21.000
               7               3.000
      1        0            5986.472
               1               0.000
               2               6.000
               3               1.000
               4               2.000
               5               3.000
               6               5.000
1     0        0               0.100
               1              23.000
               2             534.000
               3              21.000
               4              53.000
               5              12.000
2     0        0              12.000
               1              12.000
               2              12.000
               3              12.000
               4             125.000
               5              34.000

#I then want to take the average of the first n datapoints and place them into a new dataframe as such

averages = ak.to_pandas(ak.Array([[2.666,1997.333],[185.7],[12]]))
entry subentry          
0     0            2.666
      1         1997.333
1     0          185.700
2     0           12.000

I used query to look for the 2n -1 datapoint first (in this case I used n = 3 so "5") in the level subsubentry Wf_n = wf.query('subsubentry == 5') . I took the index of this new dataframe Wf_n and converted the indices of the Entry and Subentry into their respective NumPy arrays

nQuery  = wf.query('subsubentry == 5')
indices = nQuery.index.to_frame()["entry"]
indices2 = nQuery.index.to_frame()["subentry"]
ind = pd.Series.to_numpy(indices)
ind2 = pd.Series.to_numpy(indices2)

Then I used query to extract the Entries with their respective subentries with the following:

wf_AF = wf.query("entry in @ind and subentry in @ind2")

which results in this dataframe wf_AF

entry subentry subsubentry          
0     0        0               1.000
               1               2.000
               2               5.000
               3               6.000
               4               8.000
               5               3.000
               6              21.000
               7               3.000
      1        0            5986.472
               1               0.000
               2               6.000
               3               1.000
               4               2.000
               5               3.000
2     0        0               0.100
               1              23.000
               2             534.000
               3              21.000
               4              53.000
               5              12.000
      1        0               0.000
3     0        0               1.000
      1        0               2.000
      3        0              12.000
               1              12.000
               2              12.000
               3              12.000
               4             125.000
               5              34.000

It's still keeping subentries (Photons) that contain less than the desired subsubentries (datapoints) number threshold, 2*n. What am I doing wrong? Is there something I am not understanding? What can I do to achieve this specific method of filtering and can it be implemented in CuDF? Because there's so much data it would be ideal if I could replicate this as well in CuDF.

UPDATE: I deserve a dang medal...

Reading through the documentation of awkward , particularly check out help(ak.Array.__getitem__) as it's very informative. Also see help(ak.num) , help(ak.mean) etc.

These use awkwards ' own methods to filter, slice, and perform calculations on the array... which I'm sure is significantly faster than pandas. Pandas are only used here for visualization.

array = ak.Array([ [[1,2,5,6,8,3,21,3],[5986.472,0,6,1,2,3],[0]],[[1]],[[0.1,23,534,21,53,12],[0]],[[1],[2],[0],[12,12,12,12,125,34]],[[76],[23,23,43],],[[0],[12,12,12,12]] ])
filtered = array[ak.num(array, -1)>5]
df = ak.to_pandas(filtered)


entry subentry subsubentry
0     0        0               1.000
               1               2.000
               2               5.000
               3               6.000
               4               8.000
               5               3.000
               6              21.000
               7               3.000
      1        0            5986.472
               1               0.000
               2               6.000
               3               1.000
               4               2.000
               5               3.000
2     0        0               0.100
               1              23.000
               2             534.000
               3              21.000
               4              53.000
               5              12.000
3     0        0              12.000
               1              12.000
               2              12.000
               3              12.000
               4             125.000
               5              34.000

Say we only want the first three values from each of those...

first_three = filtered[:,:,:3]
df2 = ak.to_pandas(first_three)


entry subentry subsubentry
0     0        0               1.000
               1               2.000
               2               5.000
      1        0            5986.472
               1               0.000
               2               6.000
2     0        0               0.100
               1              23.000
               2             534.000
3     0        0              12.000
               1              12.000
               2              12.000

Now say we want the average of those...

averages = ak.mean(first_three, axis=2)
df3 = ak.to_pandas(averages)


entry subentry
0     0            2.666667
      1         1997.490667
2     0          185.700000
3     0           12.000000

All in one step:

ak.to_pandas(ak.mean(array[ak.num(array, -1)>5,:3], axis=2))

Pandas Only:

n = 3
wf_AF = wf.reset_index(-1).groupby(level=[-2, -1]).filter(lambda x: x.subsubentry.max() >= 2*n-1).set_index('subsubentry', append=True)
# OR
wf_AF = wf.groupby(level=[-3, -2]).filter(lambda x: x.index.to_series().str[2].max() >= 2*n-1)


entry subentry subsubentry
0     0        0               1.000
               1               2.000
               2               5.000
               3               6.000
               4               8.000
               5               3.000
               6              21.000
               7               3.000
      1        0            5986.472
               1               0.000
               2               6.000
               3               1.000
               4               2.000
               5               3.000
2     0        0               0.100
               1              23.000
               2             534.000
               3              21.000
               4              53.000
               5              12.000
3     3        0              12.000
               1              12.000
               2              12.000
               3              12.000
               4             125.000
               5              34.000

averages = wf_AF.groupby(level=[-3, -2]).agg(lambda x: x.head(n).mean())


entry subentry
0     0            2.666667
      1         1997.490667
2     0          185.700000
3     3           12.000000

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM