简体   繁体   中英

Map Pandas dataframe based on index, column name and original value?

I would like to map the values of a dataframe to values from a different dataframe (might also be a dict). The element to which I want to map depends on three things:

  1. the original value,
  2. the index name and
  3. the column name.

For example I have the following dataframe

df = pd.DataFrame(
    data={"Feature_1": [-1, 1, 1, 3], "Feature_2": [0, 2, 2, 4]},
    index=["00-1", "00-1", "00-2", "00-2"],
)

which looks like this:

      Feature_1  Feature_2
00-1         -1          0
00-1          1          2
00-2          1          2
00-2          3          4

There is another dataframe named mapping which contains the mapping rules:

dict_01 = {"00-1": {"Feature_1": [0, "A", "B"], "Feature_2": [1, "C", "D"]},
           "00-2": {"Feature_1": [2, "E", "F"], "Feature_2": [3, "G", "H"]}}
mapping = pd.DataFrame.from_dict(dict_01).transpose()

Thus, mapping look like this:

      Feature_1  Feature_2
00-1  [0, A, B]  [1, C, D]
00-2  [2, E, F]  [3, G, H]

I want to map each element to one of two values based on some threshold value. The threshold value is different for all index-feature-combinations. In the mapping dataframe the first element of each list represents the threshold value. If the orignial value is smaller than this threshold, it should be mapped to the second element of the list. If it is larger or equal, it should be mapped to the third element.

I am able to get the desired result by looping over rows and columns (see below).

df_mapped = df.copy()
for col in df_mapped.columns:
    for row in range(len(df_mapped)):
        idx = df_mapped.index[row]
        if df_mapped[col].iloc[row] < mapping[col].loc[idx][0]:
            df_mapped[col].iloc[row] = mapping[col].loc[idx][1]
        else:
            df_mapped[col].iloc[row] = mapping[col].loc[idx][2]

Result (df_mapped):

     Feature_1 Feature_2
00-1         A         C
00-1         B         D
00-2         E         G
00-2         F         H

But the actual dataset is large in both dimensions (rows and columns) and I am looking for an efficient way to compute it. When using something like apply() or map() I never seem to have access to all three things required (value, index and column name)... Is there an efficient way to achieve the desired result? Thanks a lot!

Create DataFrame with MultiIndex from lists and then compare by DataFrame.lt , for select by each level use DataFrame.xs , change index by DataFrame.reindex_like and set values by mask by DataFrame.where :

comp = [pd.DataFrame(mapping[x].values.tolist(), index=mapping.index) for x in mapping.columns]
mapping1 = pd.concat(comp, axis=1, keys=mapping.columns)
print (mapping1)
     Feature_1       Feature_2      
             0  1  2         0  1  2
00-1         0  A  B         1  C  D
00-2         2  E  F         3  G  H

mask = df.lt(mapping1.xs(0, level=1, axis=1))
df1 = (mapping1.xs(1, level=1, axis=1)
               .reindex_like(df)
               .where(mask, mapping1.xs(2, level=1, axis=1)))
print (df1)
     Feature_1 Feature_2
00-1         A         C
00-1         B         D
00-2         E         G
00-2         F         H

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM