[英]python read seperated tsv file into multiple pandas dataframes
I have some data in text file that I want to read into a dictionary of multiple pandas dataframes as values.我在文本文件中有一些数据,我想将这些数据作为值读入多个 pandas 数据帧的字典中。 The file looks like this:
该文件如下所示:
Name id chr start end coverage
> Gene : 1
source 0 1 0 0 0
1 1 1 3652523 3652641 22.1891
2 2 1 3682333 3682430 49.6224
3 3 1 3683060 3683180 43.2893
4 4 1 3698053 3698151 1.35354
5 5 1 3707549 3707791 55.2881
6 6 1 3722021 3722207 55.107
tank 16 1 0 0 0
> Gene : 2
source 0 1 0 0 0
1 1 1 9292888 9293071 15.288
2 2 1 9355743 9356585 36.3796
3 3 1 9367448 9369525 26.6434
tank 4 1 0 0 0
> Gene : 3
source 0 1 0 0 0
1 1 1 54779719 54781779 2.131
2 2 1 54782325 54782477 3.08497
3 3 1 54785983 54786144 4.00617
4 4 1 54786521 54786956 0.688073
5 5 1 54786957 54787075 4.32773
6 6 1 54787711 54787826 2.91379
7 7 1 54788042 54788097 2.16071
8 8 1 54800597 54801260 2.61596
tank 9 1 0 0 0
I would like to have a dictionary with the gene id (1,2,3,...) as key and the lines until the next gene id as the corresponding value in a pandas dataframe.我想要一个字典,其中基因 id (1,2,3,...) 作为键,直到下一个基因 id 的行作为 pandas dataframe 中的相应值。
Any tips on how to easily solve that?有关如何轻松解决该问题的任何提示?
Process it as normal text.将其作为普通文本处理。 Use
split("> Gene:")
and you will have every part in separted string.使用
split("> Gene:")
,您将拥有分隔字符串的每个部分。 In first string you will have headers
and in others you will have data
with number
in first line.在第一个字符串中,您将有
headers
,而在其他字符串中,您将在第一行中获得带有number
的data
。 And if you split it on first \n
then you will have separated number
and `data.如果你在第一个
\n
上拆分它,那么你将有分开的number
和`数据。
And then you can convert to DataFrame
without saving on disk然后您可以转换为
DataFrame
而无需保存在磁盘上
text = headers + data
df = pd.read_csv( io.StringIO(text), sep='\s+')
Minimal working example with data directly in code直接在代码中使用数据的最小工作示例
text = '''Name id chr start end coverage
> Gene : 1
source 0 1 0 0 0
1 1 1 3652523 3652641 22.1891
2 2 1 3682333 3682430 49.6224
3 3 1 3683060 3683180 43.2893
4 4 1 3698053 3698151 1.35354
5 5 1 3707549 3707791 55.2881
6 6 1 3722021 3722207 55.107
tank 16 1 0 0 0
> Gene : 2
source 0 1 0 0 0
1 1 1 9292888 9293071 15.288
2 2 1 9355743 9356585 36.3796
3 3 1 9367448 9369525 26.6434
tank 4 1 0 0 0
> Gene : 3
source 0 1 0 0 0
1 1 1 54779719 54781779 2.131
2 2 1 54782325 54782477 3.08497
3 3 1 54785983 54786144 4.00617
4 4 1 54786521 54786956 0.688073
5 5 1 54786957 54787075 4.32773
6 6 1 54787711 54787826 2.91379
7 7 1 54788042 54788097 2.16071
8 8 1 54800597 54801260 2.61596
tank 9 1 0 0 0'''
import pandas as pd
import io
#text = open(filename).read()
parts = text.split('> Gene : ')
headers = parts[0]
all_data = {}
for part in parts[1:]: # skip headers
# split on first `\n` to get number and data
number, data = part.split('\n', 1)
number = int(number)
# convert to DataFrame and add to dictionary
text = headers + data
all_data[number] = pd.read_csv(io.StringIO(text), sep='\s+')
# display it
print('---')
print('number:', number)
print(all_data[number])
Result结果
-----
number: 1
Name id chr start end coverage
0 source 0 1 0 0 0.00000
1 1 1 1 3652523 3652641 22.18910
2 2 2 1 3682333 3682430 49.62240
3 3 3 1 3683060 3683180 43.28930
4 4 4 1 3698053 3698151 1.35354
5 5 5 1 3707549 3707791 55.28810
6 6 6 1 3722021 3722207 55.10700
7 tank 16 1 0 0 0.00000
-----
number: 2
Name id chr start end coverage
0 source 0 1 0 0 0.0000
1 1 1 1 9292888 9293071 15.2880
2 2 2 1 9355743 9356585 36.3796
3 3 3 1 9367448 9369525 26.6434
4 tank 4 1 0 0 0.0000
-----
number: 3
Name id chr start end coverage
0 source 0 1 0 0 0.000000
1 1 1 1 54779719 54781779 2.131000
2 2 2 1 54782325 54782477 3.084970
3 3 3 1 54785983 54786144 4.006170
4 4 4 1 54786521 54786956 0.688073
5 5 5 1 54786957 54787075 4.327730
6 6 6 1 54787711 54787826 2.913790
7 7 7 1 54788042 54788097 2.160710
8 8 8 1 54800597 54801260 2.615960
9 tank 9 1 0 0 0.000000
Approach方法
dict
a dataframe for rows between each of the breaksdict
中import io
# create DF that is just raw text for each line in input
df = pd.DataFrame("""Name id chr start end coverage
> Gene : 1
source 0 1 0 0 0
1 1 1 3652523 3652641 22.1891
2 2 1 3682333 3682430 49.6224
3 3 1 3683060 3683180 43.2893
4 4 1 3698053 3698151 1.35354
5 5 1 3707549 3707791 55.2881
6 6 1 3722021 3722207 55.107
tank 16 1 0 0 0
> Gene : 2
source 0 1 0 0 0
1 1 1 9292888 9293071 15.288
2 2 1 9355743 9356585 36.3796
3 3 1 9367448 9369525 26.6434
tank 4 1 0 0 0
> Gene : 3
source 0 1 0 0 0
1 1 1 54779719 54781779 2.131
2 2 1 54782325 54782477 3.08497
3 3 1 54785983 54786144 4.00617
4 4 1 54786521 54786956 0.688073
5 5 1 54786957 54787075 4.32773
6 6 1 54787711 54787826 2.91379
7 7 1 54788042 54788097 2.16071
8 8 1 54800597 54801260 2.61596
tank 9 1 0 0 0""".split("\n"), columns=["raw"])
# find rows that are genes
bri = df[df["raw"].str.contains("> Gene :")].index
bri
dfs = {}
# insert into dict rows between each gene
for i, r in enumerate(bri):
endrow=len(df) if i==(len(bri)-1) else bri[i+1]-1
# NB used space delimiter not tab as pasted from SO
dfs[df.loc[r]["raw"]] = \
pd.DataFrame(df.loc[r+1:endrow]["raw"].str.split(r"[ ]+").tolist(),
columns=df.loc[0].str.split(r"[ ]+").tolist()
)
dfs
{'> Gene : 1': Name id chr start end coverage
0 source 0 1 0 0 0
1 1 1 1 3652523 3652641 22.1891
2 2 2 1 3682333 3682430 49.6224
3 3 3 1 3683060 3683180 43.2893
4 4 4 1 3698053 3698151 1.35354
5 5 5 1 3707549 3707791 55.2881
6 6 6 1 3722021 3722207 55.107
7 tank 16 1 0 0 0,
'> Gene : 2': Name id chr start end coverage
0 source 0 1 0 0 0
1 1 1 1 9292888 9293071 15.288
2 2 2 1 9355743 9356585 36.3796
3 3 3 1 9367448 9369525 26.6434
4 tank 4 1 0 0 0,
'> Gene : 3': Name id chr start end coverage
0 source 0 1 0 0 0
1 1 1 1 54779719 54781779 2.131
2 2 2 1 54782325 54782477 3.08497
3 3 3 1 54785983 54786144 4.00617
4 4 4 1 54786521 54786956 0.688073
5 5 5 1 54786957 54787075 4.32773
6 6 6 1 54787711 54787826 2.91379
7 7 7 1 54788042 54788097 2.16071
8 8 8 1 54800597 54801260 2.61596
9 tank 9 1 0 0 0}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.