简体   繁体   English

Pandas Dataframe - 无循环的组合和子组合编号系统

[英]Pandas Dataframe - Assy and Sub-Assy Numbering System without Loop

I'm working with Pandas and I have a big part list with Main Assy, Sub Assy I, Sub Assy II and Sub Assy III.我正在与 Pandas 合作,我有一个很大的零件清单,包括 Main Assy、Sub Assy I、Sub Assy II 和 Sub Assy III。 Only one "Assy" column per row can be filled with a string in the dataframe.每行只有一个“Assy”列可以用数据框中的字符串填充。 The aim is to transfer the arrangement of the parts into a numbering system.The following table shows the expected outcome :目的是将零件的排列转移到编号系统中。下表显示了预期的结果

Main Assy   Sub Assy I  Sub Assy II Sub Assy III    Level I Level II    Level III   Level IV
asd                                                    1        0            0         0
               fgd                                     1        1            0         0
                           sdd                         1        1            1         0
                           dsd                         1        1            2         0
                           fhg                         1        1            3         0
                                        tdc            1        1            3         1
                                        dyx            1        1            3         2
                                        dsg            1        1            3         3
               dfg                                     1        2            0         0
                           cvf                         1        2            1         0
                           ngs                         1        2            2         0
                           vbn                         1        2            3         0
                                        dsd            1        2            3         1
                                        vcd            1        2            3         2
                                        cbn            1        2            3         3
ged                                                    2        0            0         0
               dfs                                     2        1            0         0
                           aef                         2        1            1         0

My plan was to cumulate over the rows in the "Level"-columns as long as there are no changes in the higher level.我的计划是累积“级别”列中的行,只要更高级别没有变化。 When there is a change thus a new number on the higher level, the selected cell on a lower level needs to go back to zero.当更高级别的新数字发生变化时,较低级别的选定单元格需要返回到零。 Is there no change it keeps the same number.没有变化它保持相同的数字。 I tried the following:我尝试了以下方法:


df[lambda df: df.columns[0:4]] = df[lambda df: df.columns[0:4]].isna()

for index in range(0,4):
    mask = ((df.iloc[:,index] == False))
    print(mask)
    df.iloc[:,(index+4)] = mask.groupby((~mask).cumsum()).cumsum().astype(int)

So I check if the cell is filled by searching for missing values.因此,我通过搜索缺失值来检查单元格是否已填充。 I don't want to use a loop with lots of conditions for every row because of a big data frame.由于大数据框,我不想为每一行使用具有很多条件的循环。 I only used this one FOR-loop over the columns and tried to cumulate by creating a mask which shows changes from FALSE to TRUE.我只在列上使用了这个 FOR 循环,并尝试通过创建一个显示从 FALSE 到 TRUE 的变化的掩码来累积。

The actual outcome is:实际结果是:

Main Assy   Sub Assy I  Sub Assy II Sub Assy III    Level I Level II    Level III   Level IV
asd                                                    1        0            0         0
               fgd                                     0        1            0         0
                           sdd                         0        0            1         0
                           dsd                         0        0            2         0
                           fhg                         0        0            3         0
                                        tdc            0        0            0         1
                                        dyx            0        0            0         2
                                        dsg            0        0            0         3
               dfg                                     0        2            0         0
                           cvf                         0        0            1         0
                           ngs                         0        0            2         0
                           vbn                         0        0            3         0
                                        dsd            0        0            0         1
                                        vcd            0        0            0         2
                                        cbn            0        0            0         3
ged                                                    2        0            0         0
               dfs                                     0        1            0         0
                           aef                         0        0            1         0

What would be the right way to setup the mentioned conditional counting without using loops?在不使用循环的情况下设置上述条件计数的正确方法是什么?

Key钥匙

The change of output to be applied over each row can be fully determined by the current "level" and the previous level.应用于每一行的输出变化可以完全由当前“级别”和前一级别决定。 Here "level" means the index number of the column having a non-zero entry.这里的“级别”表示具有非零条目的列的索引号。

In other words, a state variable retaining the level of the previous row is sufficient for populating the current row correctly.换句话说,保留前一行级别的状态变量足以正确填充当前行。

Code代码

# the working dataset
df2 = df.iloc[:, :4].reset_index(drop=True)  # make a copy
df2.columns = range(4)  # rename columns to (0,1,2,3) for convenience

# output container
arr = np.zeros(df2.shape, dtype=int) 

# state variable: level of the last row
last_lv = 0

for idx, row in df2.iterrows():

    # get current indentation level
    lv = row.first_valid_index()

    if idx > 0:

        # case 1: same or decreased level
        if lv <= last_lv:
            # keep previous levels except current level
            arr[idx, :lv] = arr[idx-1, :lv]
            # current level++
            arr[idx, lv] = arr[idx-1, lv] + 1

        # case 2: increased level
        elif lv > last_lv:
            # keep previous levels
            arr[idx, :last_lv+1] = arr[idx - 1, :last_lv+1]
            # start counting the new levels
            arr[idx, last_lv+1:lv+1] = 1  

    # the first row
    else:
        arr[0, 0] = 1

    # update state variable for next use
    last_lv = lv

# append result to dataframe
df[["Level I", "Level II", "Level III", "Level IV"]] = arr

Result结果

print(df[["Level I", "Level II", "Level III", "Level IV"]])

    Level I  Level II  Level III  Level IV
0         1         0          0         0
1         1         1          0         0
2         1         1          1         0
3         1         1          2         0
4         1         1          3         0
5         1         1          3         1
6         1         1          3         2
7         1         1          3         3
8         1         2          0         0
9         1         2          1         0
10        1         2          2         0
11        1         2          3         0
12        1         2          3         1
13        1         2          3         2
14        1         2          3         3
15        2         0          0         0
16        2         1          0         0
17        2         1          1         0

Notes笔记

  1. The code just demonstrates what the logic looks like when progressing through each row.代码只是演示了在每一行中进行时的逻辑是什么样的。 It is not quite optimized, so consider using more efficient representations of the data (eg numpy array or just a list of level numbers) when efficiency becomes a problem.它没有完全优化,因此当效率成为问题时,请考虑使用更有效的数据表示(例如 numpy 数组或只是级别数字列表)。
  2. I have surveyed libraries for tree data structures such as anytree and treelib , hoping to find an automated way of outputting the tree hierarchy automagically.我已经调查了tree数据结构的库,例如anytreetreelib ,希望找到一种自动输出树层次结构的自动化方法。 Unfortunately, I/O functions suitable for reading indented text files or comparable formats seemed to be lacking.不幸的是,似乎缺少适合读取缩进文本文件或类似格式的 I/O 函数。 This is the main reason why I decide to reinvent the wheel anyway.这就是我决定重新发明轮子的主要原因。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM