简体   繁体   English

尴尬的数组在间隔中添加属性

[英]Awkward array add attributes in intervals

I want to extract data out of a root file and get it into shape to end with a numpy array/tensor to fill it into a neural-network.我想从根文件中提取数据并使其成形,以一个 numpy 数组/张量结束以将其填充到神经网络中。 I am already able to get the track data I want in shape with a padding to convert it into a numpy array, but I want to extend my array with the data of the jet they are originated from.我已经能够通过填充将我想要的轨迹数据转换成一个 numpy 数组,但是我想用它们起源的喷气机的数据扩展我的数组。 So I have the information of all the tracks, of each jet and the intervall of tracks they are corresponded too.所以我有所有轨道的信息,每个喷气机的信息以及它们对应的轨道间隔。 My first instinc was to construct an array in the shape of the tracks and using something like np.dstack to merge those two.我的第一个直觉是构建一个轨道形状的数组,并使用类似np.dstack东西来合并这两者。

import uproot4 as uproot
import numpy as np
import awkward1 as ak

def ak_into_np(ak_array):
    data=np.dstack([ak.to_numpy(x) for x in ak_array])
    return data    

def get_data(filename,padding_size):
        f=uproot.open(filename)
        events= f["btagana/ttree;1"]
        track_data=events.arrays(filter_name=["Track_pt","Track_phi","Track_eta","Track_dxy","Track_dz","Track_charge"])
        jet_interval=events.arrays(filter_name=["Jet_nFirstTrack","Jet_nLastTrack"])
        jet_interval=jet_interval["Jet_nLastTrack"]-jet_interval["Jet_nFirstTrack"]
        
        jet_data=events.arrays(filter_name=["Jet_pt","Jet_phi","Jet_eta"])
        arrays_track=ak.unzip(ak.fill_none(ak.pad_none(track_data, padding_size), 0))
        arrays_interval=ak.unzip(ak.fill_none(ak.pad_none(jet_interval,padding_size),0))
        arrays_jet=ak.unzip(ak.fill_none(ak.pad_none(jet_data,padding_size),0))
        track=ak_into_np(arrays_track)
        jet=ak_into_np(arrays_jet)
        interval=ak_into_np(arrays_interval)
        
        return track,jet,interval

This is where I am so far.这是我到目前为止的地方。 For efficiency reason I hope to be able to achieve this in awkward before going into numpy.出于效率原因,我希望能够在进入 numpy 之前以尴尬的方式实现这一点。 I tried it in numpy with following:我在 numpy 中尝试了以下内容:

def extend(track,jet,interval):
    events,tracks,varstrack=(np.shape(track))
    events,jets,varsjet=np.shape(jet)
    jet_into_track_data=[]
    for i in range(events):
        dataloop=[]
        for k in range(jets):
            if interval[i][k][0]!=0 :
                dataloop.append(np.broadcast_to(jet[i][k],(interval[i][k][0],varsjet)))
            else 
    jet_into_track_data.append(dataloop)
        
        
    
            
    return jet_into_track_data

but it already takes about 3 seconds without even achieving my goal for only 2000 events.但它已经花了大约 3 秒钟,甚至没有达到我仅 2000 个事件的目标。 The aim is basically [track_variables] ->[track_variables,jet_variables if track is in intervall] and it shall be stored [(event1)[[track_1],...,[track_padding_size]],...,(eventn)[[track_1],...,[track_padding_size]]]目标基本上是[track_variables] ->[track_variables,jet_variables if track is in intervall]并且应该存储[(event1)[[track_1],...,[track_padding_size]],...,(eventn)[[track_1],...,[track_padding_size]]]

I don't get to see the structure of your original data and I don't have a clear notion of what the desired final state is, but I can give an example inspired by the above that you can adapt.我看不到您的原始数据的结构,我对所需的最终状态没有明确的概念,但我可以举一个受上述启发的示例,您可以对其进行调整。 Also, I'm going to ignore the padding because that only complicates things.另外,我将忽略填充,因为这只会使事情复杂化。 You'll probably want to put off padding until you've finished the combinatorics.在完成组合运算之前,您可能希望推迟填充。

The tracks and jets below come from the same set of events (ie the arrays have the same lengths and they're jagged, with a different number of tracks in each list and a different number of jets in each list).下面的tracksjets来自同一组事件(即阵列具有相同的长度并且它们是锯齿状的,每个列表中的轨道数和每个列表中的喷气机数量不同)。 Since jets are somehow derived from tracks , there are strictly fewer jets than tracks, and I take it that in your problem, the link between them is such that each jet corresponds to a contiguous, non-overlapping set of tracks (the "intervals").由于jets以某种方式源自tracks ,因此喷气式飞机的数量严格少于轨道,我认为在您的问题中,它们之间的联系是这样的,每个喷气式飞机都对应于一组连续的、不重叠的tracks (“间隔”) )。

Real tracks and jets would have a lot more properties than these—this is a bare minimum example.真正的轨道和喷气式飞机将具有比这些更多的特性——这是一个最起码的例子。 I've given the tracks an "id" so we can tell one from another, but these jets only have the inclusive "start" index and exclusive "stop" index.我给了轨道一个"id"所以我们可以区分一个和另一个,但是这些喷气机只有包含的"start"索引和唯一的"stop"索引。

If your tracks don't come with identifiers, you can add them with ak.local_index .如果您的曲目没有带有标识符,您可以使用ak.local_index添加它们。

>>> import awkward1 as ak
>>> tracks = ak.Array([[{"id": 0}, {"id": 1}, {"id": 2}],
...                    [],
...                    [{"id": 0}, {"id": 1}]])
>>> jets = ak.Array([[{"start": 0, "stop": 2}, {"start": 2, "stop": 3}],
...                  [],
...                  [{"start": 1, "stop": 2}, {"start": 0, "stop": 1}]])

If you had all combinations between tracks and jets in each event, then you could use a slice to pick the ones that match.如果您在每个事件中都有tracksjets之间的所有组合,那么您可以使用切片来选择匹配的那些。 This is particularly useful when you have imprecise matches (that you have to match with ΔR or something).当您有不精确的匹配(您必须与 ΔR 或其他东西匹配)时,这尤其有用。 The ak.cartesian function produces lists of combinations, and nested=True groups the results by the first argument: ak.cartesian函数生成组合列表,并且nested=True按第一个参数对结果进行分组:

>>> all_combinations = ak.cartesian([tracks, jets], nested=True)
>>> all_combinations.tolist()
[[[({'id': 0}, {'start': 0, 'stop': 2}),
   ({'id': 0}, {'start': 2, 'stop': 3})],
  [({'id': 1}, {'start': 0, 'stop': 2}),
   ({'id': 1}, {'start': 2, 'stop': 3})],
  [({'id': 2}, {'start': 0, 'stop': 2}),
   ({'id': 2}, {'start': 2, 'stop': 3})]],
 [[]],
 [[({'id': 0}, {'start': 1, 'stop': 2}),
   ({'id': 0}, {'start': 0, 'stop': 1})],
  [({'id': 1}, {'start': 1, 'stop': 2}),
   ({'id': 1}, {'start': 0, 'stop': 1})]]]

We can go from there, selecting "id" values that are between the "start" and "stop" values.我们可以从那里开始,选择介于"start""stop"值之间的"start" "id" "stop"值。 I started writing up a solution, but the slicing gets kind of complicated, generating all Cartesian combinations is more computationally expensive than is strictly needed for this problem (though no where near as expensive as writing a for loop!), and the general view of the Cartesian product is more useful for approximate matches than the exact indexes you have.我开始写一个解决方案,但是切片变得有点复杂,生成所有笛卡尔组合的计算成本比这个问题严格需要的要多(虽然没有写一个 for 循环那么昂贵!),以及笛卡尔积对于近似匹配比您拥有的精确索引更有用。

Instead, let's write a for loop in Numba , a just-in-time compiler for Python.相反,让我们在Numba 中编写一个 for 循环,是一个用于 Python 的即时编译器。 Numba is limited in the Python that it can compile (all types must be known at compile-time), but it can recognize read-only Awkward Arrays and append-only ak.ArrayBuilder . Numba 在它可以编译的 Python 中受到限制(在编译时必须知道所有类型),但它可以识别只读的 Awkward Arrays 和 append-only ak.ArrayBuilder

Here's a loop that considers only tracks and jets in the same event, loops over tracks, and puts the first jet that matches each track into an output ArrayBuilder.这是一个循环,它只考虑同一事件中的tracksjets ,在轨道上循环,并将与每个track匹配的第一个jet放入输出 ArrayBuilder。

>>> import numba as nb
>>> @nb.njit
... def match(tracks_in_events, jets_in_events, output):
...     for tracks, jets in zip(tracks_in_events, jets_in_events):
...         output.begin_list()
...         for track in tracks:
...             for jet in jets:
...                 if jet.start <= track.id < jet.stop:
...                     output.append(jet)   # at most one
...                     break
...         output.end_list()
...     return output
... 
>>> builder = match(tracks, jets, ak.ArrayBuilder())
>>> builder.snapshot().tolist()
[[{'start': 0, 'stop': 2}, {'start': 0, 'stop': 2}, {'start': 2, 'stop': 3}],
 [],
 [{'start': 2, 'stop': 3}, {'start': 0, 'stop': 2}]]

Notice that these jet objects are duplicated to match the appropriate track.请注意,这些喷射对象被复制以匹配适当的轨道。 (This "duplication" is actually just a pointer, not really a copy.) To attach this to the tracks, you can assign it: (这个“重复”实际上只是一个指针,而不是真正的副本。)要将其附加到轨道上,您可以分配它:

>>> tracks["jet"] = builder.snapshot()
>>> tracks.tolist()
[[{'id': 0, 'jet': {'start': 0, 'stop': 2}},
  {'id': 1, 'jet': {'start': 0, 'stop': 2}},
  {'id': 2, 'jet': {'start': 2, 'stop': 3}}],
 [],
 [{'id': 0, 'jet': {'start': 2, 'stop': 3}},
  {'id': 1, 'jet': {'start': 0, 'stop': 2}}]]

Here, I've assumed that you want to attach a jet to each track—perhaps you wanted to attach the set of all associated tracks to each jet:在这里,我假设您想将一个喷气机附加到每个轨道上——也许您想将所有相关轨道的集合附加到每个喷气机上:

>>> @nb.njit
... def match2(tracks_in_events, jets_in_events, output):
...     for tracks, jets in zip(tracks_in_events, jets_in_events):
...         output.begin_list()
...         for jet in jets:
...             output.begin_list()
...             for track in tracks:
...                 if jet.start <= track.id < jet.stop:
...                     output.append(track)    # all tracks
...             output.end_list()
...         output.end_list()
...     return output
... 
>>> jets["tracks"] = match2(tracks, jets, ak.ArrayBuilder()).snapshot()
>>> jets.tolist()
[[{'start': 0, 'stop': 2, 'tracks': [
      {'id': 0, 'jet': {'start': 0, 'stop': 2}},
      {'id': 1, 'jet': {'start': 0, 'stop': 2}}]},
  {'start': 2, 'stop': 3, 'tracks': [
      {'id': 2, 'jet': {'start': 2, 'stop': 3}}]}],
 [],
 [{'start': 1, 'stop': 2, 'tracks': [
      {'id': 1, 'jet': {'start': 0, 'stop': 2}}]},
  {'start': 0, 'stop': 1, 'tracks': [
      {'id': 0, 'jet': {'start': 0, 'stop': 2}}]}]]

(Since I did both, now there are links both ways.) Attaching jets to tracks, rather than tracks to jets, has the added complication that some tracks might not be associated with any jet, in which case you'd have to account for possibly-missing data. (因为我都做了,现在双向都有链接。)将喷气机连接到轨道,而不是将轨道连接到喷气机,会增加一些复杂性,即某些轨道可能与任何喷气机都没有关联,在这种情况下,您必须考虑可能丢失的数据。 (Hint: make them lists of zero or one jet for "no match" and "yes match," then use ak.firsts to convert empty lists to Nones.) (提示:为“不匹配”和“是匹配”制作零个或一个喷气机列表,然后使用ak.firsts将空列表转换为无。)

Or you could make the output of the Numba-compiled function be plain NumPy arrays.或者您可以使 Numba 编译函数的输出成为普通的 NumPy 数组。 Numba knows a lot of NumPy functions. Numba 知道很多 NumPy 函数。 Hint for building up a Numba-compiled function: start with a minimal loop over data that doesn't do anything, test-running it as you go add output.构建 Numba 编译函数的提示:从对不执行任何操作的数据的最小循环开始,在添加输出时测试运行它。 Numba complains with a lot of output when it can't recognize the types of something—which is allowed in non-compiled Python—so it's good to know which change caused it to complain so much.当 Numba 无法识别某些东西的类型时,它会抱怨很多输出——这在未编译的 Python 中是允许的——所以很高兴知道是哪个更改导致它如此抱怨。

Hopefully, these examples can get you started!希望这些示例可以帮助您入门!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM