add column data frame

Question

I would to add to the CellID column a number in the way to classify them. The dataframe is this:

umap
                           CellID  wnnUMAP_1  wnnUMAP_2
0      KO_d0_r1:AAACAGCCACCTGCTCx  -8.127543   1.593849
1      KO_d0_r2:AAACAGCCACGTAATTx  -7.246094  -4.566527
2      HT_d0_r1:AAACAGCCATAATGAGx   7.617473   2.449949
3      HT_d0_r2:AAACATGCACCTAATGx  -7.944949   6.633856

And my resoult would be this one

 umap
                               CellID    wnnUMAP_1   wnnUMAP_2
    0      KO_d0_r1:AAACAGCCACCTGCTCx-0  -8.127543   1.593849
    1      KO_d0_r2:AAACAGCCACGTAATTx-1  -7.246094  -4.566527
    2      HT_d0_r1:AAACAGCCATAATGAGx-2   7.617473   2.449949
    3      HT_d0_r2:AAACATGCACCTAATGx-3  -7.944949   6.633856

I would to add the 0 to KO_d0_r1, a -1 to KO_d0_r2, a -2 to HT_do_r1 and a -3 HT_d0_r2. This is just an example, I have a lot of strings that have the prefix KO_d0_r1 , ecc., so I would to distinguish them by the suffix. My attempt was:

umap = umap.rename(columns = {'Unnamed: 0':'CellID'})

But it doesn't work

Answer 1

You can use.cat() to concatenate strings.

df["CellID"] = df["CellID"].str.cat([df.index.map(str)], sep="-")

https://pandas.pydata.org/docs/reference/api/pandas.Series.str.cat.html

import pandas as pd

data = [["KO_d0_r1:AAACAGCCACCTGCTCx", -8.127543, 1.593849],
        ["KO_d0_r2:AAACAGCCACGTAATTx", -7.246094, -4.566527],
        ["HT_d0_r1:AAACAGCCATAATGAGx", 7.617473, 2.449949]]

df = pd.DataFrame(data, columns=["CellID", "wnnUMAP_1", "wnnUMAP_2"])
df["CellID"] = df["CellID"].str.cat([df.index.map(str)], sep="-")

df is now:

                         CellID  wnnUMAP_1  wnnUMAP_2
0  KO_d0_r1:AAACAGCCACCTGCTCx-0  -8.127543   1.593849
1  KO_d0_r2:AAACAGCCACGTAATTx-1  -7.246094  -4.566527
2  HT_d0_r1:AAACAGCCATAATGAGx-2   7.617473   2.449949

Answer 2

another approach, and simpler solution that don't require mapping, especially if you have big number of uniques CellID.

if no duplicates in df['CellID'] :

df['CellID'] = df['CellID'] + '-' + (df.index + 1).astype(str)

if df['CellID'] contains duplicates:

df
    CellID                      wnnUMAP_1   wnnUMAP_2
0   KO_d0_r1:AAACAGCCACCTGCTCx  -8.127543   1.593849
1   KO_d0_r2:AAACAGCCACGTAATTx  -7.246094   -4.566527
2   HT_d0_r1:AAACAGCCATAATGAGx  7.617473    2.449949
3   HT_d0_r2:AAACATGCACCTAATGx  -7.944949   6.633856
4   HT_d0_r2:AAACATGCACCTAATGx  -6.944949   2.633856
5   HT_d0_r2:AAACATGCACCTAATGx  -5.944949   3.633856

df = df.merge((df['CellID'].drop_duplicates() + '-' + (df['CellID'].drop_duplicates().index + 1).astype(str)).reset_index(name='CellID_classified').eval('CellID= CellID_classified.str.split("-").str[0]').drop('index', axis=1), on='CellID', how='left').drop('CellID', axis=1)

df
    wnnUMAP_1   wnnUMAP_2   CellID_classified
0   -8.127543   1.593849    KO_d0_r1:AAACAGCCACCTGCTCx-1
1   -7.246094   -4.566527   KO_d0_r2:AAACAGCCACGTAATTx-2
2   7.617473    2.449949    HT_d0_r1:AAACAGCCATAATGAGx-3
3   -7.944949   6.633856    HT_d0_r2:AAACATGCACCTAATGx-4
4   -6.944949   2.633856    HT_d0_r2:AAACATGCACCTAATGx-4
5   -5.944949   3.633856    HT_d0_r2:AAACATGCACCTAATGx-4

Answer 3

Create a dictionary containing mapping of the prefixes to the corresponding suffix value of interest, then split CellID on : with n=1 which will basically split 1 times at max, then call Series.str.map passing the dictionary mapping object. You can finally join with the cellID column.

mapping = {'KO_d0_r1':'0', 'KO_d0_r2':'1', 'HT_d0_r1': '2', 'HT_d0_r2':'3'}

umap['CellID']=umap['CellID']\
               +'-'\
               +umap['CellID'].str.split(':', n=1).str[0].map(mapping)

OUTPUT

                         CellID  wnnUMAP_1  wnnUMAP_2
0  KO_d0_r1:AAACAGCCACCTGCTCx-0  -8.127543   1.593849
1  KO_d0_r2:AAACAGCCACGTAATTx-1  -7.246094  -4.566527
2  HT_d0_r1:AAACAGCCATAATGAGx-2   7.617473   2.449949
3  HT_d0_r2:AAACATGCACCTAATGx-3  -7.944949   6.633856

PS: map returns NaN for values that could not be mapped which may throw a TypeError , for the data, I just assumed that it is always going to exist, else, you may want to handle it.

If you are not so concerned about the suffices and just require a unique number to be assigned, you can also use groupby then call ngroup() :

umap['CellID'] = umap['CellID'] \
                 + '-' \
                 + (umap
                    .groupby(umap['CellID'].str.split(':', n=1).str[0], sort=False)
                    .ngroup()
                    .astype('str')
                    )

add column data frame

Question

3 answers

solution1
1 2022-09-24 12:57:01

solution2
1 2022-09-24 13:32:05

solution3
0 2022-09-24 12:55:41

add column data frame

Question

3 answers

solution1 1 2022-09-24 12:57:01

solution2 1 2022-09-24 13:32:05

solution3 0 2022-09-24 12:55:41

solution1
1 2022-09-24 12:57:01

solution2
1 2022-09-24 13:32:05

solution3
0 2022-09-24 12:55:41