[英]Pandas: How to check if any of a list in a dataframe column is present in a range in another dataframe?
I'm trying to compare two bioinformatic DataFrames (one with transcription start and end genomic locations, and one with expression data). 我正在尝试比较两个生物信息学DataFrames(一个具有转录起始和最终基因组位置,一个具有表达数据)。 I need to check if any of a list of locations in one DataFrame is present within ranges defined by the start and end locations in the other DataFrame, returning rows/ids where they match.
我需要检查一个DataFrame中的任何位置列表是否存在于另一个DataFrame中的起始位置和结束位置定义的范围内,返回它们匹配的行/ ID。
I have tried a number of built-in methods (.isin, .where, .query,), but usually get stuck because the lists are nonhashable. 我已经尝试了许多内置方法(.isin,.where,.query,),但通常会因为列表不可用而卡住。 I've also tried a nested for loop with iterrows and itertuples, which is exceedingly slow (my actual datasets are thousands of entries).
我还尝试了一个带有iterrows和itertuples的嵌套for循环,这非常慢(我的实际数据集是数千个条目)。
tss_df = pd.DataFrame(data={'id':['gene1','gene2'],
'locs':[[21,23],[34,39]]})
exp_df = pd.DataFrame(data={'gene':['geneA','geneB'],
'start': [15,31], 'end': [25,42]})
I'm looking to find that the row with id 'gene1' in tss_df has locations (locs) that match 'geneA' in exp_df. 我想找到tss_df中id为'gene1'的行的位置(locs)与exp_df中的'geneA'匹配。
The output would be something like: 输出将是这样的:
output = pd.DataFrame(data={'id':['gene1','gene2'],
'locs': [[21,23],[34,39]],
'match': ['geneA','geneB']})
Edit: Based on a comment below, I tried playing with merge_asof
: 编辑:根据下面的评论,我尝试使用
merge_asof
:
pd.merge_asof(tss_df,exp_df,left_on='locs',right_on='start')
This gave me an incompatible merge keys error, I suspect because I'm comparing a list to integer; 这给了我一个不兼容的合并键错误,我怀疑是因为我将列表与整数进行比较; so I split out the first value in locs:
所以我拆分了locs中的第一个值:
tss_df['loc1'] = tss_df['locs'][0]
pd.merge_asof(tss_df,exp_df,left_on='loc1',right_on='start')
This appears to have worked for my test data, but I'll need to try it with my actual data! 这似乎适用于我的测试数据,但我需要尝试使用我的实际数据!
Based on a comment below, I tried playing with merge_asof
: 根据下面的评论,我尝试使用
merge_asof
:
pd.merge_asof(tss_df,exp_df,left_on='locs',right_on='start')
This gave me an incompatible merge keys error, I suspect because I'm comparing a list to integer; 这给了我一个不兼容的合并键错误,我怀疑是因为我将列表与整数进行比较; so I split out the first value in locs:
所以我拆分了locs中的第一个值:
tss_df['loc1'] = tss_df['locs'][0]
pd.merge_asof(tss_df,exp_df,left_on='loc1',right_on='start')
This appears to have worked for my test data! 这似乎适用于我的测试数据!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.