简体   繁体   中英

Awkward array add attributes in intervals

I want to extract data out of a root file and get it into shape to end with a numpy array/tensor to fill it into a neural-network. I am already able to get the track data I want in shape with a padding to convert it into a numpy array, but I want to extend my array with the data of the jet they are originated from. So I have the information of all the tracks, of each jet and the intervall of tracks they are corresponded too. My first instinc was to construct an array in the shape of the tracks and using something like np.dstack to merge those two.

import uproot4 as uproot
import numpy as np
import awkward1 as ak

def ak_into_np(ak_array):
    data=np.dstack([ak.to_numpy(x) for x in ak_array])
    return data    

def get_data(filename,padding_size):
        f=uproot.open(filename)
        events= f["btagana/ttree;1"]
        track_data=events.arrays(filter_name=["Track_pt","Track_phi","Track_eta","Track_dxy","Track_dz","Track_charge"])
        jet_interval=events.arrays(filter_name=["Jet_nFirstTrack","Jet_nLastTrack"])
        jet_interval=jet_interval["Jet_nLastTrack"]-jet_interval["Jet_nFirstTrack"]
        
        jet_data=events.arrays(filter_name=["Jet_pt","Jet_phi","Jet_eta"])
        arrays_track=ak.unzip(ak.fill_none(ak.pad_none(track_data, padding_size), 0))
        arrays_interval=ak.unzip(ak.fill_none(ak.pad_none(jet_interval,padding_size),0))
        arrays_jet=ak.unzip(ak.fill_none(ak.pad_none(jet_data,padding_size),0))
        track=ak_into_np(arrays_track)
        jet=ak_into_np(arrays_jet)
        interval=ak_into_np(arrays_interval)
        
        return track,jet,interval

This is where I am so far. For efficiency reason I hope to be able to achieve this in awkward before going into numpy. I tried it in numpy with following:

def extend(track,jet,interval):
    events,tracks,varstrack=(np.shape(track))
    events,jets,varsjet=np.shape(jet)
    jet_into_track_data=[]
    for i in range(events):
        dataloop=[]
        for k in range(jets):
            if interval[i][k][0]!=0 :
                dataloop.append(np.broadcast_to(jet[i][k],(interval[i][k][0],varsjet)))
            else 
    jet_into_track_data.append(dataloop)
        
        
    
            
    return jet_into_track_data

but it already takes about 3 seconds without even achieving my goal for only 2000 events. The aim is basically [track_variables] ->[track_variables,jet_variables if track is in intervall] and it shall be stored [(event1)[[track_1],...,[track_padding_size]],...,(eventn)[[track_1],...,[track_padding_size]]]

I don't get to see the structure of your original data and I don't have a clear notion of what the desired final state is, but I can give an example inspired by the above that you can adapt. Also, I'm going to ignore the padding because that only complicates things. You'll probably want to put off padding until you've finished the combinatorics.

The tracks and jets below come from the same set of events (ie the arrays have the same lengths and they're jagged, with a different number of tracks in each list and a different number of jets in each list). Since jets are somehow derived from tracks , there are strictly fewer jets than tracks, and I take it that in your problem, the link between them is such that each jet corresponds to a contiguous, non-overlapping set of tracks (the "intervals").

Real tracks and jets would have a lot more properties than these—this is a bare minimum example. I've given the tracks an "id" so we can tell one from another, but these jets only have the inclusive "start" index and exclusive "stop" index.

If your tracks don't come with identifiers, you can add them with ak.local_index .

>>> import awkward1 as ak
>>> tracks = ak.Array([[{"id": 0}, {"id": 1}, {"id": 2}],
...                    [],
...                    [{"id": 0}, {"id": 1}]])
>>> jets = ak.Array([[{"start": 0, "stop": 2}, {"start": 2, "stop": 3}],
...                  [],
...                  [{"start": 1, "stop": 2}, {"start": 0, "stop": 1}]])

If you had all combinations between tracks and jets in each event, then you could use a slice to pick the ones that match. This is particularly useful when you have imprecise matches (that you have to match with ΔR or something). The ak.cartesian function produces lists of combinations, and nested=True groups the results by the first argument:

>>> all_combinations = ak.cartesian([tracks, jets], nested=True)
>>> all_combinations.tolist()
[[[({'id': 0}, {'start': 0, 'stop': 2}),
   ({'id': 0}, {'start': 2, 'stop': 3})],
  [({'id': 1}, {'start': 0, 'stop': 2}),
   ({'id': 1}, {'start': 2, 'stop': 3})],
  [({'id': 2}, {'start': 0, 'stop': 2}),
   ({'id': 2}, {'start': 2, 'stop': 3})]],
 [[]],
 [[({'id': 0}, {'start': 1, 'stop': 2}),
   ({'id': 0}, {'start': 0, 'stop': 1})],
  [({'id': 1}, {'start': 1, 'stop': 2}),
   ({'id': 1}, {'start': 0, 'stop': 1})]]]

We can go from there, selecting "id" values that are between the "start" and "stop" values. I started writing up a solution, but the slicing gets kind of complicated, generating all Cartesian combinations is more computationally expensive than is strictly needed for this problem (though no where near as expensive as writing a for loop!), and the general view of the Cartesian product is more useful for approximate matches than the exact indexes you have.

Instead, let's write a for loop in Numba , a just-in-time compiler for Python. Numba is limited in the Python that it can compile (all types must be known at compile-time), but it can recognize read-only Awkward Arrays and append-only ak.ArrayBuilder .

Here's a loop that considers only tracks and jets in the same event, loops over tracks, and puts the first jet that matches each track into an output ArrayBuilder.

>>> import numba as nb
>>> @nb.njit
... def match(tracks_in_events, jets_in_events, output):
...     for tracks, jets in zip(tracks_in_events, jets_in_events):
...         output.begin_list()
...         for track in tracks:
...             for jet in jets:
...                 if jet.start <= track.id < jet.stop:
...                     output.append(jet)   # at most one
...                     break
...         output.end_list()
...     return output
... 
>>> builder = match(tracks, jets, ak.ArrayBuilder())
>>> builder.snapshot().tolist()
[[{'start': 0, 'stop': 2}, {'start': 0, 'stop': 2}, {'start': 2, 'stop': 3}],
 [],
 [{'start': 2, 'stop': 3}, {'start': 0, 'stop': 2}]]

Notice that these jet objects are duplicated to match the appropriate track. (This "duplication" is actually just a pointer, not really a copy.) To attach this to the tracks, you can assign it:

>>> tracks["jet"] = builder.snapshot()
>>> tracks.tolist()
[[{'id': 0, 'jet': {'start': 0, 'stop': 2}},
  {'id': 1, 'jet': {'start': 0, 'stop': 2}},
  {'id': 2, 'jet': {'start': 2, 'stop': 3}}],
 [],
 [{'id': 0, 'jet': {'start': 2, 'stop': 3}},
  {'id': 1, 'jet': {'start': 0, 'stop': 2}}]]

Here, I've assumed that you want to attach a jet to each track—perhaps you wanted to attach the set of all associated tracks to each jet:

>>> @nb.njit
... def match2(tracks_in_events, jets_in_events, output):
...     for tracks, jets in zip(tracks_in_events, jets_in_events):
...         output.begin_list()
...         for jet in jets:
...             output.begin_list()
...             for track in tracks:
...                 if jet.start <= track.id < jet.stop:
...                     output.append(track)    # all tracks
...             output.end_list()
...         output.end_list()
...     return output
... 
>>> jets["tracks"] = match2(tracks, jets, ak.ArrayBuilder()).snapshot()
>>> jets.tolist()
[[{'start': 0, 'stop': 2, 'tracks': [
      {'id': 0, 'jet': {'start': 0, 'stop': 2}},
      {'id': 1, 'jet': {'start': 0, 'stop': 2}}]},
  {'start': 2, 'stop': 3, 'tracks': [
      {'id': 2, 'jet': {'start': 2, 'stop': 3}}]}],
 [],
 [{'start': 1, 'stop': 2, 'tracks': [
      {'id': 1, 'jet': {'start': 0, 'stop': 2}}]},
  {'start': 0, 'stop': 1, 'tracks': [
      {'id': 0, 'jet': {'start': 0, 'stop': 2}}]}]]

(Since I did both, now there are links both ways.) Attaching jets to tracks, rather than tracks to jets, has the added complication that some tracks might not be associated with any jet, in which case you'd have to account for possibly-missing data. (Hint: make them lists of zero or one jet for "no match" and "yes match," then use ak.firsts to convert empty lists to Nones.)

Or you could make the output of the Numba-compiled function be plain NumPy arrays. Numba knows a lot of NumPy functions. Hint for building up a Numba-compiled function: start with a minimal loop over data that doesn't do anything, test-running it as you go add output. Numba complains with a lot of output when it can't recognize the types of something—which is allowed in non-compiled Python—so it's good to know which change caused it to complain so much.

Hopefully, these examples can get you started!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM