简体   繁体   中英

Slicing a hierarchical dataframe in Pandas

I have an hierarchical based excel which looks something like this:

Df
lev1    lev2   lev3    lev4   lev5   description
RD21    Nan    Nan     Nan    Nan    Oil
Nan     RD32   Nan     Nan    Nan    Oil/Canola
Nan     Nan    RD33    Nan    Nan    Oil/Canola/Wheat
Nan     Nan    RD34    Nan    Nan    Oil/Canola/Flour
Nan     Nan    Nan     RD55   Nan    Oil/Canola/Flour/Thick
ED54    Nan    Nan     Nan    Nan    Rice
Nan     ED66   Nan     Nan    Nan    Rice/White
Nan     Nan    ED88    Nan    Nan    Rice/White/Jasmine
Nan     Nan    ED89    Nan    Nan    Rice/White/Basmati
Nan     ED68   Nan     Nan    Nan    Rice/Brown

I would like to get the all level codes based on my selection from the column "description". Eg1: if I search for "Brown" in the description: it should give me something like this:

ED54: Rice
ED68: Rice/Brown

Eg2: If I search for "Thick" in the description column: it should give me something like this:

RD21: Oil
RD32: Oil/Canola
RD34: Oil/Canola/Flour
RD55: Oil/Canola/Flour/Thick

The searching for a word is quite easily handled using Df["Descriptions"].str.contains(word) also I can use a regular expression for finding specific pattern if required. But how do we get the codes associated for this word hierarchy.

Create the hierarchical dict data by lev1~5

vv = df.apply(
    lambda x: (
        x.iloc[len(x.description.split('/'))-1],
        x.description.split('/')
    ), axis=1
).values

vv looks like:

array([('RD21', ['Oil']), ('RD32', ['Oil', 'Canola']),
       ('RD33', ['Oil', 'Canola', 'Wheat']),
       ('RD34', ['Oil', 'Canola', 'Flour']),
       ('RD55', ['Oil', 'Canola', 'Flour', 'Thick']), ('ED54', ['Rice']),
       ('ED66', ['Rice', 'White']),
       ('ED88', ['Rice', 'White', 'Jasmine']),
       ('ED89', ['Rice', 'White', 'Basmati']),
       ('ED68', ['Rice', 'Brown'])], dtype=object)

Create hierarchical dictionary by using vv

d = {}
for i in vv:
    v = i[0] # RD33
    k = i[1] # ['Oil', 'Canola', 'Wheat']

    # loop and set last value in key "RD33"
    f_d = d
    for j in k[:-1]:
        f_d = f_d[j]
    f_d[k[-1]] = {'_value': v}

d looks like:

{'Oil': {'_value': 'RD21',
  'Canola': {'_value': 'RD32',
   'Wheat': {'_value': 'RD33'},
   'Flour': {'_value': 'RD34', 'Thick': {'_value': 'RD55'}}}},
 'Rice': {'_value': 'ED54',
  'White': {'_value': 'ED66',
   'Jasmine': {'_value': 'ED88'},
   'Basmati': {'_value': 'ED89'}},
  'Brown': {'_value': 'ED68'}}}

Then say you search the word by Df["Descriptions"].str.contains(word) (or regular expression), which returns:

'Oil/Canola/Flour/Thick'

You can get the results like:

desc_split = 'Oil/Canola/Flour/Thick'.split('/')
res = []
for i in range(len(desc_split)):
    all_keys = desc_split[:i+1]
    d2 = d
    for k in all_keys:
        d2 = d2[k]
    res.append(f"{d2['_value']}: {'/'.join(all_keys)}")

res looks like:

['RD21: Oil',
 'RD32: Oil/Canola',
 'RD34: Oil/Canola/Flour',
 'RD55: Oil/Canola/Flour/Thick']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM