[英]Group rows where columns have values within range in pandas df
我有一個熊貓 df:
number sample chrom1 start chrom2 end
1 s1 1 0 2 1500
2 s1 2 10 2 50
19 s2 3 3098318 3 3125700
19 s3 3 3098720 3 3125870
20 s4 3 3125694 3 3126976
20 s1 3 3125694 3 3126976
20 s1 3 3125695 3 3126976
20 s5 3 3125700 3 3126976
21 s3 3 3125870 3 3134920
22 s2 3 3126976 3 3135039
24 s5 3 17286051 3 17311472
25 s2 3 17286052 3 17294628
26 s4 3 17286052 3 17311472
26 s1 3 17286052 3 17311472
27 s3 3 17286405 3 17294550
28 s4 3 17293197 3 17294628
28 s1 3 17293197 3 17294628
28 s5 3 17293199 3 17294628
29 s2 3 17294628 3 17311472
我正在嘗試對具有不同編號的行進行分組,但start
在+/- 10
之內,而終點也在同一染色體上的+/- 10
之內。
在這個例子中,我想找到這兩行:
24 s5 3 17286051 3 17311472
26 s4 3 17286052 3 17311472
如果兩者具有相同的chrom1
[3]
和chrom2
[3]
,並且start
值和結束值彼此chrom2
+/- 10
,則將它們分組在相同的數字下:
24 s5 3 17286051 3 17311472
24 s4 3 17286052 3 17311472 # Change the number to the first seen in this series
這是我正在嘗試的:
import pandas as pd
from collections import defaultdict
def parse_vars(inFile):
df = pd.read_csv(inFile, delimiter="\t")
df = df[['number', 'chrom1', 'start', 'chrom2', 'end']]
vars = {}
seen_l = defaultdict(lambda: defaultdict(dict)) # To track the `starts`
seen_r = defaultdict(lambda: defaultdict(dict)) # To track the `ends`
for index in df.index:
event = df.loc[index, 'number']
c1 = df.loc[index, 'chrom1']
b1 = int(df.loc[index, 'start'])
c2 = df.loc[index, 'chrom2']
b2 = int(df.loc[index, 'end'])
print [event, c1, b1, c2, b2]
vars[event] = [c1, b1, c2, b2]
# Iterate over windows +/- 10
for i, j in zip( range(b1-10, b1+10), range(b2-10, b2+10) ):
# if :
# i in seen_l[c1] AND
# j in seen_r[c2] AND
# the 'number' for these two instances is the same:
if i in seen_l[c1] and j in seen_r[c2] and seen_l[c1][i] == seen_r[c2][j]:
print seen_l[c1][i], seen_r[c2][j]
if seen_l[c1][i] != event: print"Seen: %s %s in event %s %s" % (event, [c1, b1, c2, b2], seen_l[c1][i], vars[seen_l[c1][i]])
seen_l[c1][b1] = event
seen_r[c2][b2] = event
我遇到的問題是, seen_l[3][17286052]
存在於numbers
25
和26
,並且作為它們各自的seen_r
事件( seen_r[3][17294628] = 25
, seen_r[3][17311472] = 26
) 不相等,我無法將這些行連接在一起。
有沒有一種方法,我可以使用清單start
值作為套疊的鍵seen_l
字典?
pyranges 中的區間重疊很容易。 下面的大部分代碼是將開始和結束分成兩個不同的dfs。 然后根據 +-10 的區間重疊將它們連接起來:
from io import StringIO
import pandas as pd
import pyranges as pr
c = """number sample chrom1 start chrom2 end
1 s1 1 0 2 1500
2 s1 2 10 2 50
19 s2 3 3098318 3 3125700
19 s3 3 3098720 3 3125870
20 s4 3 3125694 3 3126976
20 s1 3 3125694 3 3126976
20 s1 3 3125695 3 3126976
20 s5 3 3125700 3 3126976
21 s3 3 3125870 3 3134920
22 s2 3 3126976 3 3135039
24 s5 3 17286051 3 17311472
25 s2 3 17286052 3 17294628
26 s4 3 17286052 3 17311472
26 s1 3 17286052 3 17311472
27 s3 3 17286405 3 17294550
28 s4 3 17293197 3 17294628
28 s1 3 17293197 3 17294628
28 s5 3 17293199 3 17294628
29 s2 3 17294628 3 17311472"""
df = pd.read_table(StringIO(c), sep="\s+")
df1 = df[["chrom1", "start", "number", "sample"]]
df1.insert(2, "end", df.start + 1)
df2 = df[["chrom2", "end", "number", "sample"]]
df2.insert(2, "start", df.end - 1)
names = ["Chromosome", "Start", "End", "number", "sample"]
df1.columns = names
df2.columns = names
gr1, gr2 = pr.PyRanges(df1), pr.PyRanges(df2)
j = gr1.join(gr2, slack=10)
# +--------------+-----------+-----------+-----------+------------+-----------+-----------+------------+------------+
# | Chromosome | Start | End | number | sample | Start_b | End_b | number_b | sample_b |
# | (category) | (int32) | (int32) | (int64) | (object) | (int32) | (int32) | (int64) | (object) |
# |--------------+-----------+-----------+-----------+------------+-----------+-----------+------------+------------|
# | 3 | 3125694 | 3125695 | 20 | s4 | 3125700 | 3125699 | 19 | s2 |
# | 3 | 3125694 | 3125695 | 20 | s1 | 3125700 | 3125699 | 19 | s2 |
# | 3 | 3125695 | 3125696 | 20 | s1 | 3125700 | 3125699 | 19 | s2 |
# | 3 | 3125700 | 3125701 | 20 | s5 | 3125700 | 3125699 | 19 | s2 |
# | ... | ... | ... | ... | ... | ... | ... | ... | ... |
# | 3 | 17294628 | 17294629 | 29 | s2 | 17294628 | 17294627 | 25 | s2 |
# | 3 | 17294628 | 17294629 | 29 | s2 | 17294628 | 17294627 | 28 | s5 |
# | 3 | 17294628 | 17294629 | 29 | s2 | 17294628 | 17294627 | 28 | s1 |
# | 3 | 17294628 | 17294629 | 29 | s2 | 17294628 | 17294627 | 28 | s4 |
# +--------------+-----------+-----------+-----------+------------+-----------+-----------+------------+------------+
# Unstranded PyRanges object has 13 rows and 9 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome.
# to get the data as a pandas df:
jdf = j.df
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.