简体   繁体   English

添加列数据框

[英]add column data frame

I would to add to the CellID column a number in the way to classify them.我会在 CellID 列中添加一个数字来对它们进行分类。 The dataframe is this: dataframe 是这样的:

umap
                           CellID  wnnUMAP_1  wnnUMAP_2
0      KO_d0_r1:AAACAGCCACCTGCTCx  -8.127543   1.593849
1      KO_d0_r2:AAACAGCCACGTAATTx  -7.246094  -4.566527
2      HT_d0_r1:AAACAGCCATAATGAGx   7.617473   2.449949
3      HT_d0_r2:AAACATGCACCTAATGx  -7.944949   6.633856

And my resoult would be this one我的结果就是这个

 umap
                               CellID    wnnUMAP_1   wnnUMAP_2
    0      KO_d0_r1:AAACAGCCACCTGCTCx-0  -8.127543   1.593849
    1      KO_d0_r2:AAACAGCCACGTAATTx-1  -7.246094  -4.566527
    2      HT_d0_r1:AAACAGCCATAATGAGx-2   7.617473   2.449949
    3      HT_d0_r2:AAACATGCACCTAATGx-3  -7.944949   6.633856

I would to add the 0 to KO_d0_r1, a -1 to KO_d0_r2, a -2 to HT_do_r1 and a -3 HT_d0_r2.我会将0 to KO_d0_r1, a -1 to KO_d0_r2, a -2 to HT_do_r1 and a -3 HT_d0_r2. This is just an example, I have a lot of strings that have the prefix KO_d0_r1 , ecc., so I would to distinguish them by the suffix.这只是一个例子,我有很多带有前缀KO_d0_r1 ,ecc. 的字符串,所以我会通过后缀来区分它们。 My attempt was:我的尝试是:

umap = umap.rename(columns = {'Unnamed: 0':'CellID'})

But it doesn't work但它不起作用

You can use.cat() to concatenate strings.您可以使用 .cat() 连接字符串。

df["CellID"] = df["CellID"].str.cat([df.index.map(str)], sep="-")

https://pandas.pydata.org/docs/reference/api/pandas.Series.str.cat.html https://pandas.pydata.org/docs/reference/api/pandas.Series.str.cat.html

import pandas as pd

data = [["KO_d0_r1:AAACAGCCACCTGCTCx", -8.127543, 1.593849],
        ["KO_d0_r2:AAACAGCCACGTAATTx", -7.246094, -4.566527],
        ["HT_d0_r1:AAACAGCCATAATGAGx", 7.617473, 2.449949]]

df = pd.DataFrame(data, columns=["CellID", "wnnUMAP_1", "wnnUMAP_2"])
df["CellID"] = df["CellID"].str.cat([df.index.map(str)], sep="-")

df is now: df 现在是:

                         CellID  wnnUMAP_1  wnnUMAP_2
0  KO_d0_r1:AAACAGCCACCTGCTCx-0  -8.127543   1.593849
1  KO_d0_r2:AAACAGCCACGTAATTx-1  -7.246094  -4.566527
2  HT_d0_r1:AAACAGCCATAATGAGx-2   7.617473   2.449949

another approach, and simpler solution that don't require mapping, especially if you have big number of uniques CellID.另一种方法和更简单的不需要映射的解决方案,特别是如果您有大量的唯一 CellID。

  1. if no duplicates in df['CellID'] :如果df['CellID']中没有重复项:
df['CellID'] = df['CellID'] + '-' + (df.index + 1).astype(str)
  1. if df['CellID'] contains duplicates:如果df['CellID']包含重复项:
df
    CellID                      wnnUMAP_1   wnnUMAP_2
0   KO_d0_r1:AAACAGCCACCTGCTCx  -8.127543   1.593849
1   KO_d0_r2:AAACAGCCACGTAATTx  -7.246094   -4.566527
2   HT_d0_r1:AAACAGCCATAATGAGx  7.617473    2.449949
3   HT_d0_r2:AAACATGCACCTAATGx  -7.944949   6.633856
4   HT_d0_r2:AAACATGCACCTAATGx  -6.944949   2.633856
5   HT_d0_r2:AAACATGCACCTAATGx  -5.944949   3.633856

df = df.merge((df['CellID'].drop_duplicates() + '-' + (df['CellID'].drop_duplicates().index + 1).astype(str)).reset_index(name='CellID_classified').eval('CellID= CellID_classified.str.split("-").str[0]').drop('index', axis=1), on='CellID', how='left').drop('CellID', axis=1)

df
    wnnUMAP_1   wnnUMAP_2   CellID_classified
0   -8.127543   1.593849    KO_d0_r1:AAACAGCCACCTGCTCx-1
1   -7.246094   -4.566527   KO_d0_r2:AAACAGCCACGTAATTx-2
2   7.617473    2.449949    HT_d0_r1:AAACAGCCATAATGAGx-3
3   -7.944949   6.633856    HT_d0_r2:AAACATGCACCTAATGx-4
4   -6.944949   2.633856    HT_d0_r2:AAACATGCACCTAATGx-4
5   -5.944949   3.633856    HT_d0_r2:AAACATGCACCTAATGx-4

Create a dictionary containing mapping of the prefixes to the corresponding suffix value of interest, then split CellID on : with n=1 which will basically split 1 times at max, then call Series.str.map passing the dictionary mapping object.创建一个字典,其中包含前缀到感兴趣的相应后缀值的映射,然后将CellID拆分为: ,其中n=1基本上最多拆分 1 次,然后调用Series.str.map传递字典映射 object。 You can finally join with the cellID column.您终于可以加入cellID列。

mapping = {'KO_d0_r1':'0', 'KO_d0_r2':'1', 'HT_d0_r1': '2', 'HT_d0_r2':'3'}

umap['CellID']=umap['CellID']\
               +'-'\
               +umap['CellID'].str.split(':', n=1).str[0].map(mapping)

OUTPUT OUTPUT

                         CellID  wnnUMAP_1  wnnUMAP_2
0  KO_d0_r1:AAACAGCCACCTGCTCx-0  -8.127543   1.593849
1  KO_d0_r2:AAACAGCCACGTAATTx-1  -7.246094  -4.566527
2  HT_d0_r1:AAACAGCCATAATGAGx-2   7.617473   2.449949
3  HT_d0_r2:AAACATGCACCTAATGx-3  -7.944949   6.633856

PS: map returns NaN for values that could not be mapped which may throw a TypeError , for the data, I just assumed that it is always going to exist, else, you may want to handle it. PS: map为无法映射的值返回NaN ,这可能会引发TypeError ,对于数据,我只是假设它总是会存在,否则,您可能想要处理它。

If you are not so concerned about the suffices and just require a unique number to be assigned, you can also use groupby then call ngroup() :如果您不太关心足够的内容并且只需要分配一个唯一的号码,您也可以使用groupby然后调用ngroup()

umap['CellID'] = umap['CellID'] \
                 + '-' \
                 + (umap
                    .groupby(umap['CellID'].str.split(':', n=1).str[0], sort=False)
                    .ngroup()
                    .astype('str')
                    )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM