[英]python read seperated tsv file into multiple pandas dataframes
我在文本文件中有一些數據,我想將這些數據作為值讀入多個 pandas 數據幀的字典中。 該文件如下所示:
Name id chr start end coverage
> Gene : 1
source 0 1 0 0 0
1 1 1 3652523 3652641 22.1891
2 2 1 3682333 3682430 49.6224
3 3 1 3683060 3683180 43.2893
4 4 1 3698053 3698151 1.35354
5 5 1 3707549 3707791 55.2881
6 6 1 3722021 3722207 55.107
tank 16 1 0 0 0
> Gene : 2
source 0 1 0 0 0
1 1 1 9292888 9293071 15.288
2 2 1 9355743 9356585 36.3796
3 3 1 9367448 9369525 26.6434
tank 4 1 0 0 0
> Gene : 3
source 0 1 0 0 0
1 1 1 54779719 54781779 2.131
2 2 1 54782325 54782477 3.08497
3 3 1 54785983 54786144 4.00617
4 4 1 54786521 54786956 0.688073
5 5 1 54786957 54787075 4.32773
6 6 1 54787711 54787826 2.91379
7 7 1 54788042 54788097 2.16071
8 8 1 54800597 54801260 2.61596
tank 9 1 0 0 0
我想要一個字典,其中基因 id (1,2,3,...) 作為鍵,直到下一個基因 id 的行作為 pandas dataframe 中的相應值。
有關如何輕松解決該問題的任何提示?
將其作為普通文本處理。 使用split("> Gene:")
,您將擁有分隔字符串的每個部分。 在第一個字符串中,您將有headers
,而在其他字符串中,您將在第一行中獲得帶有number
的data
。 如果你在第一個\n
上拆分它,那么你將有分開的number
和`數據。
然后您可以轉換為DataFrame
而無需保存在磁盤上
text = headers + data
df = pd.read_csv( io.StringIO(text), sep='\s+')
直接在代碼中使用數據的最小工作示例
text = '''Name id chr start end coverage
> Gene : 1
source 0 1 0 0 0
1 1 1 3652523 3652641 22.1891
2 2 1 3682333 3682430 49.6224
3 3 1 3683060 3683180 43.2893
4 4 1 3698053 3698151 1.35354
5 5 1 3707549 3707791 55.2881
6 6 1 3722021 3722207 55.107
tank 16 1 0 0 0
> Gene : 2
source 0 1 0 0 0
1 1 1 9292888 9293071 15.288
2 2 1 9355743 9356585 36.3796
3 3 1 9367448 9369525 26.6434
tank 4 1 0 0 0
> Gene : 3
source 0 1 0 0 0
1 1 1 54779719 54781779 2.131
2 2 1 54782325 54782477 3.08497
3 3 1 54785983 54786144 4.00617
4 4 1 54786521 54786956 0.688073
5 5 1 54786957 54787075 4.32773
6 6 1 54787711 54787826 2.91379
7 7 1 54788042 54788097 2.16071
8 8 1 54800597 54801260 2.61596
tank 9 1 0 0 0'''
import pandas as pd
import io
#text = open(filename).read()
parts = text.split('> Gene : ')
headers = parts[0]
all_data = {}
for part in parts[1:]: # skip headers
# split on first `\n` to get number and data
number, data = part.split('\n', 1)
number = int(number)
# convert to DataFrame and add to dictionary
text = headers + data
all_data[number] = pd.read_csv(io.StringIO(text), sep='\s+')
# display it
print('---')
print('number:', number)
print(all_data[number])
結果
-----
number: 1
Name id chr start end coverage
0 source 0 1 0 0 0.00000
1 1 1 1 3652523 3652641 22.18910
2 2 2 1 3682333 3682430 49.62240
3 3 3 1 3683060 3683180 43.28930
4 4 4 1 3698053 3698151 1.35354
5 5 5 1 3707549 3707791 55.28810
6 6 6 1 3722021 3722207 55.10700
7 tank 16 1 0 0 0.00000
-----
number: 2
Name id chr start end coverage
0 source 0 1 0 0 0.0000
1 1 1 1 9292888 9293071 15.2880
2 2 2 1 9355743 9356585 36.3796
3 3 3 1 9367448 9369525 26.6434
4 tank 4 1 0 0 0.0000
-----
number: 3
Name id chr start end coverage
0 source 0 1 0 0 0.000000
1 1 1 1 54779719 54781779 2.131000
2 2 2 1 54782325 54782477 3.084970
3 3 3 1 54785983 54786144 4.006170
4 4 4 1 54786521 54786956 0.688073
5 5 5 1 54786957 54787075 4.327730
6 6 6 1 54787711 54787826 2.913790
7 7 7 1 54788042 54788097 2.160710
8 8 8 1 54800597 54801260 2.615960
9 tank 9 1 0 0 0.000000
方法
dict
中import io
# create DF that is just raw text for each line in input
df = pd.DataFrame("""Name id chr start end coverage
> Gene : 1
source 0 1 0 0 0
1 1 1 3652523 3652641 22.1891
2 2 1 3682333 3682430 49.6224
3 3 1 3683060 3683180 43.2893
4 4 1 3698053 3698151 1.35354
5 5 1 3707549 3707791 55.2881
6 6 1 3722021 3722207 55.107
tank 16 1 0 0 0
> Gene : 2
source 0 1 0 0 0
1 1 1 9292888 9293071 15.288
2 2 1 9355743 9356585 36.3796
3 3 1 9367448 9369525 26.6434
tank 4 1 0 0 0
> Gene : 3
source 0 1 0 0 0
1 1 1 54779719 54781779 2.131
2 2 1 54782325 54782477 3.08497
3 3 1 54785983 54786144 4.00617
4 4 1 54786521 54786956 0.688073
5 5 1 54786957 54787075 4.32773
6 6 1 54787711 54787826 2.91379
7 7 1 54788042 54788097 2.16071
8 8 1 54800597 54801260 2.61596
tank 9 1 0 0 0""".split("\n"), columns=["raw"])
# find rows that are genes
bri = df[df["raw"].str.contains("> Gene :")].index
bri
dfs = {}
# insert into dict rows between each gene
for i, r in enumerate(bri):
endrow=len(df) if i==(len(bri)-1) else bri[i+1]-1
# NB used space delimiter not tab as pasted from SO
dfs[df.loc[r]["raw"]] = \
pd.DataFrame(df.loc[r+1:endrow]["raw"].str.split(r"[ ]+").tolist(),
columns=df.loc[0].str.split(r"[ ]+").tolist()
)
dfs
{'> Gene : 1': Name id chr start end coverage
0 source 0 1 0 0 0
1 1 1 1 3652523 3652641 22.1891
2 2 2 1 3682333 3682430 49.6224
3 3 3 1 3683060 3683180 43.2893
4 4 4 1 3698053 3698151 1.35354
5 5 5 1 3707549 3707791 55.2881
6 6 6 1 3722021 3722207 55.107
7 tank 16 1 0 0 0,
'> Gene : 2': Name id chr start end coverage
0 source 0 1 0 0 0
1 1 1 1 9292888 9293071 15.288
2 2 2 1 9355743 9356585 36.3796
3 3 3 1 9367448 9369525 26.6434
4 tank 4 1 0 0 0,
'> Gene : 3': Name id chr start end coverage
0 source 0 1 0 0 0
1 1 1 1 54779719 54781779 2.131
2 2 2 1 54782325 54782477 3.08497
3 3 3 1 54785983 54786144 4.00617
4 4 4 1 54786521 54786956 0.688073
5 5 5 1 54786957 54787075 4.32773
6 6 6 1 54787711 54787826 2.91379
7 7 7 1 54788042 54788097 2.16071
8 8 8 1 54800597 54801260 2.61596
9 tank 9 1 0 0 0}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.