[英]Splitting a pandas dataframe column by delimiter
i have a small sample data:我有一个小样本数据:
import pandas as pd
df = {'ID': [3009, 129, 119, 120, 121, 122, 130, 3014, 266, 849, 174, 844],
'V': ['IGHV7-B*01', 'IGHV7-B*01', 'IGHV6-A*01', 'GHV6-A*01', 'IGHV6-A*01',
'IGHV6-A*01', 'IGHV4-L*03', 'IGHV4-L*03', 'IGHV5-A*01', 'IGHV5-A*04',
'IGHV6-A*02','IGHV6-A*02'],
'Prob': [1, 1, 0.8, 0.8056, 0.9, 0.805, 1, 1, 0.997, 0.401, 1, 1]}
df = pd.DataFrame(df)
looks like好像
df
Out[25]:
ID Prob V
0 3009 1.0000 IGHV7-B*01
1 129 1.0000 IGHV7-B*01
2 119 0.8000 IGHV6-A*01
3 120 0.8056 IGHV6-A*01
4 121 0.9000 IGHV6-A*01
5 122 0.8050 IGHV6-A*01
6 130 1.0000 IGHV4-L*03
7 3014 1.0000 IGHV4-L*03
8 266 0.9970 IGHV5-A*01
9 849 0.4010 IGHV5-A*04
10 174 1.0000 IGHV6-A*02
11 844 1.0000 IGHV6-A*02
I want to split the column 'V' by the '-' delimiter and move it to another column named 'allele'我想用“-”分隔符拆分“V”列并将其移动到名为“等位基因”的另一列
Out[25]:
ID Prob V allele
0 3009 1.0000 IGHV7 B*01
1 129 1.0000 IGHV7 B*01
2 119 0.8000 IGHV6 A*01
3 120 0.8056 IGHV6 A*01
4 121 0.9000 IGHV6 A*01
5 122 0.8050 IGHV6 A*01
6 130 1.0000 IGHV4 L*03
7 3014 1.0000 IGHV4 L*03
8 266 0.9970 IGHV5 A*01
9 849 0.4010 IGHV5 A*04
10 174 1.0000 IGHV6 A*02
11 844 1.0000 IGHV6 A*02
the code i have tried so far is incomplete and didn't work:到目前为止,我尝试过的代码不完整,无法正常工作:
df1 = pd.DataFrame()
df1[['V']] = pd.DataFrame([ x.split('-') for x in df['V'].tolist() ])
or或者
df.add(Series, axis='columns', level = None, fill_value = None)
newdata = df.DataFrame({'V':df['V'].iloc[::2].values,
'Allele': df['V'].iloc[1::2].values})
Use vectoried str.split
with expand=True
:使用矢量化str.split
和expand=True
:
In [42]:
df[['V','allele']] = df['V'].str.split('-',expand=True)
df
Out[42]:
ID Prob V allele
0 3009 1.0000 IGHV7 B*01
1 129 1.0000 IGHV7 B*01
2 119 0.8000 IGHV6 A*01
3 120 0.8056 GHV6 A*01
4 121 0.9000 IGHV6 A*01
5 122 0.8050 IGHV6 A*01
6 130 1.0000 IGHV4 L*03
7 3014 1.0000 IGHV4 L*03
8 266 0.9970 IGHV5 A*01
9 849 0.4010 IGHV5 A*04
10 174 1.0000 IGHV6 A*02
11 844 1.0000 IGHV6 A*02
For storing data into a new dataframe use the same approach, just with the new dataframe:要将数据存储到新数据帧中,请使用相同的方法,只需使用新数据帧:
tmpDF = pd.DataFrame(columns=['A','B'])
tmpDF[['A','B']] = df['V'].str.split('-', expand=True)
Eventually (and more usefull for my purposes) if you would need get only a part of the string value (ie text before '-'), you could use .str.split(...).str[idx] like:最终(对我的目的更有用)如果您只需要获取字符串值的一部分(即“-”之前的文本),您可以使用 .str.split(...).str[idx] ,例如:
df['V'] = df['V'].str.split('-').str[0]
df
ID V Prob
0 3009 IGHV7 1.0000
1 129 IGHV7 1.0000
2 119 IGHV6 0.8000
3 120 GHV6 0.8056
- splits 'V' values into list according to separator '-' and stores 1st item back to the column - 根据分隔符“-”将“V”值拆分为列表并将第一个项目存储回列
Use the below:使用以下:
df['allele'] = [x.split('-')[-1] for x in df['V']]
df['V'] = [x.split('-')[-0] for x in df['V']]
df.head(3)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.