[英]How to split one column into two columns in python?
我在熊貓中加載了一個 contig 文件,如下所示:
>NODE_1_length_4014_cov_1.97676
1 AATTAATGAAATAAAGCAAGAAGACAAGGTTAGACAAAAAAAAGAG...
2 CAAAGCCTCCAAGAAATATGGGACTATGTGAAAAGACCAAATCTAC...
3 CCTGAAAGTGACGGGGAGAATGGAACCAAGTTGGAAAACACTCTGC...
4 GAGAACTTCCCCAATCTAGCAAGGCAGGCCAACATTCAAATTCAGG...
5 CCACAAAGATACTCCTCGAGAAGAGCAACTCCAAGACACATAATTG...
6 GTTGAAATGAAGGAAAAAATGTTAAGGGCAGCCAGAGAGAAAGGTC...
7 GGGAAGCCCATCAGACTAACAGCGGATCTCTCGGCAGAAACCCTAC...
8 TGGGGGCCAATATTCAACATTCTTAAAGAAAAGAATTTTCAACCCA...
9 GCCAAACTAAGCTTCATAAGCAAAGGAGAAATAAAATCCTTTACAG...
10 AGAGATTTTGTCACCACCAGGCCTGCCTTACAAGAGCTCCTGAAGG...
11 GAAAGGAAAAACCGGTACCAGCCACTGCAAAATCATGCCAAACTGT...
12 CTAGGAAGAAACTGCATCAACTAATGAGCAAAATAACCAGCTAACA...
13 TCAAATTCACACATAACAATATTAACCTTAAATGTAAATGGGCTAA...
14 AGACACAGACTGGCAAATTGGATAAAGAGTCAAGACCCATCAGTGT...
15 ACCCATCTCAAATGCAGAGACACACATAGGCTCAAAATAAAGGGAT...
16 CAAGCAAATGGAAAACAAAAAAAGGCAGGGGTTGCAATCCTAGTCT...
17 TTTAAACCAACAAAGATCAAAAGAGACAAAGAAGGCCATTACATAA...
18 ATTCAACAAGAAGAGCTAACTATCCTAAATATATATGCACCCAATA...
19 TTCATAAAGCAAGTCCTCAGTGACCTACAAAGAGACTTAGACTCCC...
20 GGAGACTTTAACACCCCACTGTCAACATTAGACAGATCAACGAGAC...
21 GATATCCAGGAATTGAACTCAGCTCTGCACCAAGCGGACCTAATAG...
22 CTCCACCCCAAATCAACAGAATATACATTCTTTTCAGCACCACACC...
23 ATTGACCACATAGTTGGAAGTAAAGCTCTCCTCAGCAAATGTAAAA...
24 ACAAACTGTCTCTCAGACCACAGTGCAATCAAATTAGAACTCAGGA...
25 CAAAACTGCTCAACTACATGAAAACTGAACAACCTGCTCCTGAATG...
26 AACAAAATGAAGGCAGAAATAAAGATGTTCTTTGAAACCAATGAGA...
27 TACCAGAATCTCTGGGACGCATTCAAAGCAGTGTGTAGAGGGAAAT...
28 GCCCACAAGAGAAAGCAGGAAAGATCTAAAATTGACACCCTAACAT...
29 CTAGAGAAGCAAGAGCAAACACATTCAAAAGCTAGCAGAAGGCAAG...
...
8540 >NODE_2518_length_56_cov_219
8541 CCCTTGTTGGTGTTACAAAGCCCTTGAACTACATCAGCAAAGACAA...
8542 >NODE_2519_length_56_cov_174
8543 CCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCTCGA...
8544 >NODE_2520_length_56_cov_131
8545 CCCAGGAGACTTGTCTTTGCTGATGTAGTTCAAGAGCTTTGTAACA...
8546 >NODE_2521_length_56_cov_118
8547 GGCTCCCTATCGGCTCGAATTCCGCTCGACTATTATCGAATTCCGC...
8548 >NODE_2522_length_56_cov_96
8549 CCCGCCCCCAGGAGACTTGTCTTTGCTGATAGTAGTCGAGCGGAAT...
8550 >NODE_2523_length_56_cov_74
8551 AGAGACTTGTCTTTGCTGATGTAGTTCAAGGGCTTTGTAACACCGA...
8552 >NODE_2524_length_56_cov_70
8553 TGCTCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCT...
8554 >NODE_2525_length_56_cov_59
8555 GAGACCCTTGTCGGTGTTACAAAGCCCTTTAACTACATCAGCAAAG...
8556 >NODE_2526_length_56_cov_48
8557 CCGACTACTATCGAATTCCGCTCGACTACTATCGAATTCCGCTCGA...
8558 >NODE_2527_length_56_cov_44
8559 CCAAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGGCATT...
8560 >NODE_2528_length_56_cov_42
8561 GAGACCCTTGTAGGTGTTACAAAGCCCTTGAACTACATCAGCAAAG...
8562 >NODE_2529_length_56_cov_38
8563 GAGACCCTTGTCGGTGTCACAAAGCCCTTGAACTACATCAGCAAAG...
8564 >NODE_2530_length_56_cov_29
8565 GAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGGCATTCT...
8566 >NODE_2531_length_56_cov_26
8567 AGGTTCAAGGGCTTTGTAACACCGACAAGGGTCTCGAAAACATCGG...
8568 >NODE_2532_length_56_cov_25
8569 GAGATGTGTATAAGAGACTTGTCTTTGCTGATGTAGTTCAAGGGCT...
如何將這一列拆分為兩列,使一列中的 >NODE_...... 和另一列中的相應序列? 另一個問題是序列在多行中,如何將它們變成一個字符串? 結果是這樣的:
contig sequence
NODE_1_length_4014_cov_1.97676 AAAAAAAAAAAAAAA
NODE_........ TTTTTTTTTTTTTTT
非常感謝。
我無法重現您的示例,但我的猜測是您正在加載帶有未格式化為表格格式的 Pandas 的文件。 從您的示例看來,您的文件已格式化:
>Identifier
sequence
>Identifier
sequence
您必須先解析文件,然后才能將信息放入 Pandas 數據框中。 邏輯是遍歷文件的每一行,如果該行以 '>Node' 開頭,則將該行存儲為標識符。 如果不是,則將它們連接到序列值。 像這樣的東西:
testfile = '>NODE_1_length_4014_cov_1.97676\nAAAAAAAATTTTTTCCCCCCCGGGGGG\n>NODE_2518_length_56_cov_219\nAAAAAAAAGCCCTTTTT'.split('\n')
identifiers = []
sequences = []
current_sequence = ''
for line in testfile:
if line.startswith('>'):
identifiers.append(line)
sequences.append(current_sequence)
current_sequence = ''
else:
current_sequence += line.strip('\n')
df = pd.DataFrame({'identifiers' = identifiers,
'sequences' = sequences})
此代碼是否有效取決於您未提供的輸入的詳細信息,但這可能會讓您開始。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.