[英]Spliting the values of column into new column within a dataframe to filter
I have a dataframe, which looks as follows, 我有一个数据框,如下所示,
Head1 Header2
ABC SAP (+115590), GRN (+426250)
EFG HES3 (-6350), CMT (-1902)
HIJ CORT (-19440), API (+177)
KLM AAD (-25488), DH(-1341) ,DSQ(+120001)
SOS MFA (-11174), 13A2 (+19763)
And I need to split the second column with a comma and create new column within the same data frame. 我需要用逗号分割第二列并在同一数据框中创建新列。 In addition to that, I need to take out all values within the brackets and create another column with that numeric information to filter further.
除此之外,我需要取出括号内的所有值,并创建另一个包含该数字信息的列以进一步过滤。
So far I am able to do it with a not so elegant piece of code and it's so lengthy as follows, 到目前为止,我能够使用一个不那么优雅的代码来完成它并且它如此冗长如下,
Trans = 'file.txt'
Trans = pd.read_csv(Trans, sep="\t", header=0)
Trans.columns=["RNA","PCs"]
# Here I changed the dtype to string to do split
Trans.PCs=Trans.PCs.astype(str)
#I took out those first part of second column into new column PC1
Trans["PC1"]=Trans.PCs.str.extract('(\w*)', expand=True)
#Here I splited the neuwmric informationf rom first part
Trans[['Strand1','Dis1']] = Trans.PCs.str.extract('([+-])(\d*)', expand=True)
Trans.head()
Head Header2 Head1 Strand1 Dis1
ABC SAP (+11559), GRN (+42625) SAP + 115590
EFG HES3 (-6350), CMT (-1902) HES3 - 6350
HIJ CORT (-19440), API (+177) CORT - 19440
KLM AAD (-25488), DH(-1341) AAD - 25488
SOS MFA (-11174), 13A2 (+19763) MFA - 11174
And I need the above data frame to split again, so I using the following piece of code for second part of column 2 我需要再次拆分上面的数据框,所以我在第2列的第二部分使用以下代码
# this for second part of 2nd column Trans["PC2"]=Trans.PCs.str.split(',').str.get(1) # did for neumric information Trans[['Strand2','Dis2']] = Trans.PC2.str.extract('([+-])(\\d*)', expand=True)
Trans['PC2']=Trans.PC2.str.replace(r"\(.*\)","")
# At this point the daframe looks like this,
Head Header2 Head1 Strand1 Dis1 Head2 Strand2 Dis2
ABC SAP (+11559), GRN (+42625) SAP + 115590 GRN + 426250
EFG HES3 (-6350), CMT (-1902) HES3 - 6350 CMT - 1902
HIJ CORT (-19440), API (+177) CORT - 19440 API + 177
KLM AAD (-25488), DH(-1341) AAD - 25488 DH - 1341
SOS MFA (-11174), 13A2 (+19763),DSQ(+120001) MFA - 11174 13A2 + 19763
Trans=Trans.fillna(0) Trans.Dis1=Trans.Dis1.astype(int) Trans.Dis2=Trans.Dis2.astype(int)
# Here I am filtering the rows based on Dis1 and Dis2 columns from daframe
> Trans_Pc1=Trans.loc[:,"lncRNA":"Dis1"].query('Dis1 >= 100000')
> Trans_Pc2=Trans.loc[:,"PC2":"Dis2"].query('Dis2 >= 100000')
> TransPC1=Trans_Pc1.PC1
> TransPC2=Trans_Pc2.PC2
> TransPCs=pd.concat([TransPC1,TransPC2])
this looks like this, 这看起来像这样,
Header
SAP
GRN
DSQ
Even though the script is lengthy is working , But I have problem when the second column has rows with more than 2 commas separated value like here in the row, 即使脚本很长也行,但是当第二列的行中有超过2个逗号分隔值的行时,我有问题,就像这里的行一样,
KLM AAD (-25488), DH(-1341) ,DSQ(+120001)
It has three comma separated values, I know I have to repeat the split again but my data frame is really big and has many rows with unequal comma separated values.Like for example, some rows has 2 comma separated values for column 2 and some has 5 and so on. 它有三个逗号分隔值,我知道我必须再次重复拆分但我的数据框非常大并且有许多行具有不等的逗号分隔值。例如,某些行有2个逗号分隔值,第2列有些5等等。
Any better way to filter my frame would be great. 任何更好的方式来过滤我的框架将是伟大的。 In the end, I am aiming a dataframe as follows,
最后,我的目标数据如下,
header
SAP
GRN
DSQ
Any help or suggestions would be really great 任何帮助或建议都会非常棒
Try: 尝试:
df = pd.DataFrame(
[
['ABC', 'SAP (+115590), GRN (+426250)'],
['EFG', 'HES3 (-6350), CMT (-1902)'],
['HIJ', 'CORT (-19440), API (+177)'],
['KLM', 'AAD (-25488), DH(-1341) ,DSQ(+120001)'],
['SOS', 'MFA (-11174), 13A2 (+19763)'],
], columns=['Head1', 'Header2'])
df1 = df.Header2.str.split(',', expand=True)
regex = r'(?P<Head>\w+).*\((?P<Strand>[+-])(?P<Dis>.*)\)'
extract = lambda df: df.iloc[0].str.extract(regex, expand=True)
extracted = df1.groupby(level=0).apply(extract)
df2 = extracted.stack().unstack([2, 1])
colseries = df2.columns.to_series()
df2.columns = colseries.str.get(0).astype(str) + colseries.str.get(1).astype(str)
pd.concat([df, df2], axis=1)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.