[英]How to merge two files using for-loop, if-else?
從以下數據:
data01 =
pos ids sample1_value sample2_value
2969 a:b:c 12:13:15 12:13:15
3222 a:b:c 13:13:16 21:33:41
3416 a:b:c 19:13:18 21:33:41
5207 a:b:c 11:33:41 91:33:41
5238 a:b:c 21:13:45 31:27:63
5398 a:b:c 31:27:63 28:63:41
5403 a:b:c 15:7:125 71:33:41
5426 a:b:c 12:13:25 82:25:14
5434 a:b:c 12:17:15 52:33:52
說我為每個樣本計算了另一個id(d)值,但不是在每一行中。
data02 =
pos ids sample1_value sample2_value
2969 d 21 96
3416 d 52 85
5207 d 63 85
5398 d 27 52
5403 d 63 52
5434 d 81 63
問題:
我想為每個樣本的每一行寫下d的這個值。
是否可以使用for循環寫回值?
預期最終產出:
pos ids sample1_value sample2_value
2969 a:b:c:d 12:13:15:21 12:13:15:.
3222 a:b:c:d 13:13:16:. 21:33:41:.
3416 a:b:c:d 19:13:18:52 21:33:41:.
................................
.......................... in the same way as above
我只為sample01嘗試了以下代碼:
data01 = open('data01.txt', 'r')
header01 = data01.readline()
data01 = data01.read().rstrip('n').split('\n')
# similar code for data02
data01_new = open('data01_new.txt', 'w')
data01_new.write(header01 + '\n')
for lines in data01:
values01 = lines.split('\t')
pos01 = values01[0]
ids01 = values01[1]
sample1_val01 = values01[2]
for lines in data02:
values02 = lines.split('\t')
pos02 = values02[0]
ids02 = values02[1]
sample1_val02 = values02[2]
if pos01 == pos02:
data01_update = open('data01_new.txt', 'a')
data01_update.write('\t'.join(pos1, (ids01+':'+ids02), sample1_val01+':'+sample1_val02)
else:
data01_update = open('data01_new.txt', 'a')
data01_update.write('\t'.join(pos1, (ids01+':'+ids02), sample1_val01+':'+'.')
是否可以使用for-loop and if-else
來解決這個問題?
如果沒有,我如何使用熊貓來解決這個問題?
這是一種方法,首先合並pos上的兩個數據,然后加入id,sample1和sample 2,最后只使用所需的列
data = data1.merge(data2, on = 'pos',how = 'outer').fillna('.')
data['ids'] = data['ids_x'] + ':'+ data['ids_y']
data['sample1_value'] = data['sample1_value_x'].astype(str) + ':'+
data['sample1_value_y'].astype(str)
data['sample2_value'] = data['sample2_value_x'].astype(str) + ':'+
data['sample2_value_y'].astype(str)
data = data[['pos', 'ids', 'sample1_value', 'sample2_value']]
pos ids sample1_value sample2_value
0 2969 a:b:c:d 12:13:15:21.0 12:13:15:96.0
1 3222 a:b:c:. 13:13:16:. 21:33:41:.
2 3416 a:b:c:d 19:13:18:52.0 21:33:41:85.0
3 5207 a:b:c:d 11:33:41:63.0 91:33:41:85.0
4 5238 a:b:c:. 21:13:45:. 31:27:63:.
5 5398 a:b:c:d 31:27:63:27.0 28:63:41:52.0
6 5403 a:b:c:d 15:7:125:63.0 71:33:41:52.0
7 5426 a:b:c:. 12:13:25:. 82:25:14:.
8 5434 a:b:c:d 12:17:15:81.0 52:33:52:63.0
這是對您當前邏輯的修正。
循環遍歷更新文件,一次一行。 對於每個新行,將主文件前進到匹配的pos
; 在途中寫出不匹配的行。
找到匹配項后,請更新信息(您已知道該怎么做)。
data01 = open('data01.txt', 'r')
header01 = data01.readline()
# similar code for data02
data01_new = open('data01_new.txt', 'w')
data01_new.write(header01 + '\n')
line01 = data01.readline()
values01 = data01.readline().split(\t)
pos01 = values01[0]
for line02 in data02:
# Parse next update line.
values02 = line02.split('\t')
ids02 = values02[1]
sample1_val02 = values02[2]
pos02 = values02[0]
# Find the next line of master file to update.
# Extract the pos until it matches update pos.
while pos01 < pos02:
# Write the previous line (not matched or updated).
data01_new.write(line01)
values01 = data01.readline().split(\t)
pos01 = values01[0]
ids01 = values01[1]
sample1_val01 = values01[2]
# At this point, you have pos01 == pos02
# Update the information as needed;
# put the result into line01,
# so it gets written on the next "while" iteration.
如果文件按“pos”排序,則可以在此處理一行。
def parse_line(line):
return line.split()
line1 = f1.readline()
line2 = f2.readline()
while line1:
pos1, id1, v1, w1 = parse_line(line1)
pos2, id2, v2, w2 = parse_line(line2)
if pos2 == pos1:
out_file.write('{:s}\t{:s}:{:s}\t{:s}:{:s}\t{:s}:{:s}\n'.format(
pos1, id1, id2, v1, v2, w1, w2))
line2 = f2.readline()
else:
out_file.write('{:s}\t{:s}:{:s}\t{:s}:{:s}\t{:s}:{:s}\n'.format(
pos1, id1, id2, v1, '.', w1, '.'))
line1 = f1.readline()
輸出:
2969 a:b:c:d 12:13:15:21 12:13:15:96
3222 a:b:c:d 13:13:16:. 21:33:41:.
3416 a:b:c:d 19:13:18:52 21:33:41:85
5207 a:b:c:d 11:33:41:63 91:33:41:85
5238 a:b:c:d 21:13:45:. 31:27:63:.
5398 a:b:c:d 31:27:63:27 28:63:41:52
5403 a:b:c:d 15:7:125:63 71:33:41:52
5426 a:b:c:d 12:13:25:. 82:25:14:.
5434 a:b:c:d 12:17:15:81 52:33:52:63
#Merge two DFs on pos column
df3 = pd.merge(data01,data02,how='left',on='pos',suffixes=['','_y']).fillna('.')
#transfer data to a numpy array
data = df3.iloc[:,1:].values.astype(np.str).reshape(-1,2,3).transpose(1,0,2)
#concatenate relevant columns with ':' as delimeter.
df3.iloc[:,1:4] =np.core.defchararray.add(np.core.defchararray.add(data[0],':'),data[1])
#take the columns required.
df_final = df3[['pos', 'ids', 'sample1_value', 'sample2_value']]
Out[1372]:
pos ids sample1_value sample2_value
0 2969 a:b:c:d 12:13:15:21.0 12:13:15:96.0
1 3222 a:b:c:. 13:13:16:. 21:33:41:.
2 3416 a:b:c:d 19:13:18:52.0 21:33:41:85.0
3 5207 a:b:c:d 11:33:41:63.0 91:33:41:85.0
4 5238 a:b:c:. 21:13:45:. 31:27:63:.
5 5398 a:b:c:d 31:27:63:27.0 28:63:41:52.0
6 5403 a:b:c:d 15:7:125:63.0 71:33:41:52.0
7 5426 a:b:c:. 12:13:25:. 82:25:14:.
8 5434 a:b:c:d 12:17:15:81.0 52:33:52:63.0
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.