[英]How update one dataframe's column by matching columns in two different dataframes in Pandas
[英]Matching subset of two columns of two different dataframes
比较两个不同数据帧的特定列。 计算两个数据帧的子集是匹配还是不匹配。
条件:如果文件small['genes of cluster']
任何元素与big['genes of cluster']
match: 1
,则输出应为: match: 1
。
对于以下示例,仅OR4F16
与两个数据帧匹配。 所以输出: match: 1; unmatch: 3.
match: 1; unmatch: 3.
file1: big <tab separated>
cl nP genes of cluster
1 11 DDX11L1, MIR6859-3, WASH7P, MIR1302-2, FAM138C, FAM138F, FAM138A, OR4F5, LOC729737, LOC102725121, FAM138D
2 4 OR4F16, OR4F3, OR4F29, LOC100132287
3 64 LOC100133331, LOC100288069, FAM87B, LINC00115, LINC01128, FAM41C, LINC02593, SAMD11
4 7 GNB1, CALML6, TMEM52, CFAP74, GABRD, LOC105378591, PRKCZ
file2: small <tab separated>
cl nP genes of cluster
1 11 A, B, C, D
2 4 OR4F16, X, Y, Z
我的代码:Python3
def genes_coordinates(big, small):
b = pd.read_csv(big, header=0, sep="\t")
s = pd.read_csv(small, header=0, sep="\t")
match = 0
unmatch = 0
for index, row in b.iterrows():
if row[row['genes of cluster'].isin(s['genes of cluster'])]:
match+1
else:
unmatch+1
print("match: ", match, "\nunmatch: ", unmatch)
genes_coordinates('big','small')
我会用pandas.merge()跟随列表理解计数。
import pandas as pd
df1 = pd.DataFrame({'cl':[1,2], 'nP':[11,4], 'gene of cluster':[['A', 'B', 'C', 'D'], ['OR4F16', 'X', 'Y', 'Z']]})
df2 = pd.DataFrame({'cl':[1,2,3,4], 'nP':[11,4,64,7], 'gene of cluster':[['DDX11L1', 'MIR6859-3', 'WASH7P', 'MIR1302-2', 'FAM138C', 'FAM138F', 'FAM138A', 'OR4F5', 'LOC729737', 'LOC102725121', 'FAM138D'], ['OR4F16', 'OR4F3', 'OR4F29', 'LOC100132287'], ['LOC100133331', 'LOC100288069', 'FAM87B', 'LINC00115', 'LINC01128', 'FAM41C', 'LINC02593', 'SAMD11'], ['GNB1', 'CALML6', 'TMEM52', 'CFAP74', 'GABRD', 'LOC105378591', 'PRKCZ']]})
df_m = df1.merge(df2, on=['cl', 'nP'], how='outer')
>>>df_m
cl nP gene of cluster_x gene of cluster_y
0 1 11 [A, B, C, D] [DDX11L1, MIR6859-3, WASH7P, MIR1302-2, FAM138...
1 2 4 [OR4F16, X, Y, Z] [OR4F16, OR4F3, OR4F29, LOC100132287]
2 3 64 NaN [LOC100133331, LOC100288069, FAM87B, LINC00115...
3 4 7 NaN [GNB1, CALML6, TMEM52, CFAP74, GABRD, LOC10537...
# An np.nan value is an outright 'unmatch'
found = []
for x in df_m.index:
if isinstance(df_m.iloc[x]['gene of cluster_x'], float):
found.append(0)
else:
if isinstance(df_m.iloc[x]['gene of cluster_y'], float):
found.append(0)
elif any([y in df_m.iloc[x]['gene of cluster_y'] for y in df_m.iloc[x]['gene of cluster_x']]):
found.append(1)
else:
found.append(0)
# The counts
match = sum(found)
unmatch = len(found) - match
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.