简体   繁体   English

python 将单独的 tsv 文件读入多个 pandas 数据帧

[英]python read seperated tsv file into multiple pandas dataframes

I have some data in text file that I want to read into a dictionary of multiple pandas dataframes as values.我在文本文件中有一些数据,我想将这些数据作为值读入多个 pandas 数据帧的字典中。 The file looks like this:该文件如下所示:

Name    id    chr    start    end    coverage
> Gene : 1
source  0       1       0       0       0
1       1       1       3652523 3652641 22.1891
2       2       1       3682333 3682430 49.6224
3       3       1       3683060 3683180 43.2893
4       4       1       3698053 3698151 1.35354
5       5       1       3707549 3707791 55.2881
6       6       1       3722021 3722207 55.107
tank    16      1       0       0       0
> Gene : 2
source  0       1       0       0       0
1       1       1       9292888 9293071 15.288
2       2       1       9355743 9356585 36.3796
3       3       1       9367448 9369525 26.6434
tank    4       1       0       0       0
> Gene : 3
source  0       1       0       0       0
1       1       1       54779719        54781779        2.131
2       2       1       54782325        54782477        3.08497
3       3       1       54785983        54786144        4.00617
4       4       1       54786521        54786956        0.688073
5       5       1       54786957        54787075        4.32773
6       6       1       54787711        54787826        2.91379
7       7       1       54788042        54788097        2.16071
8       8       1       54800597        54801260        2.61596
tank    9       1       0       0       0

I would like to have a dictionary with the gene id (1,2,3,...) as key and the lines until the next gene id as the corresponding value in a pandas dataframe.我想要一个字典,其中基因 id (1,2,3,...) 作为键,直到下一个基因 id 的行作为 pandas dataframe 中的相应值。

Any tips on how to easily solve that?有关如何轻松解决该问题的任何提示?

Process it as normal text.将其作为普通文本处理。 Use split("> Gene:") and you will have every part in separted string.使用split("> Gene:") ,您将拥有分隔字符串的每个部分。 In first string you will have headers and in others you will have data with number in first line.在第一个字符串中,您将有headers ,而在其他字符串中,您将在第一行中获得带有numberdata And if you split it on first \n then you will have separated number and `data.如果你在第一个\n上拆分它,那么你将有分开的number和`数据。

And then you can convert to DataFrame without saving on disk然后您可以转换为DataFrame而无需保存在磁盘上

text = headers + data
df = pd.read_csv( io.StringIO(text), sep='\s+')

Minimal working example with data directly in code直接在代码中使用数据的最小工作示例

text = '''Name    id    chr    start    end    coverage
> Gene : 1
source  0       1       0       0       0
1       1       1       3652523 3652641 22.1891
2       2       1       3682333 3682430 49.6224
3       3       1       3683060 3683180 43.2893
4       4       1       3698053 3698151 1.35354
5       5       1       3707549 3707791 55.2881
6       6       1       3722021 3722207 55.107
tank    16      1       0       0       0
> Gene : 2
source  0       1       0       0       0
1       1       1       9292888 9293071 15.288
2       2       1       9355743 9356585 36.3796
3       3       1       9367448 9369525 26.6434
tank    4       1       0       0       0
> Gene : 3
source  0       1       0       0       0
1       1       1       54779719        54781779        2.131
2       2       1       54782325        54782477        3.08497
3       3       1       54785983        54786144        4.00617
4       4       1       54786521        54786956        0.688073
5       5       1       54786957        54787075        4.32773
6       6       1       54787711        54787826        2.91379
7       7       1       54788042        54788097        2.16071
8       8       1       54800597        54801260        2.61596
tank    9       1       0       0       0'''

import pandas as pd
import io

#text = open(filename).read()

parts = text.split('> Gene : ')

headers = parts[0]

all_data = {}

for part in parts[1:]: # skip headers

    # split on first `\n` to get number and data
    number, data = part.split('\n', 1)
    number = int(number)
    
    # convert to DataFrame and add to dictionary
    text = headers + data
    all_data[number] = pd.read_csv(io.StringIO(text), sep='\s+')
    
    # display it
    print('---')
    print('number:', number)
    print(all_data[number])

Result结果

-----
number: 1
     Name  id  chr    start      end  coverage
0  source   0    1        0        0   0.00000
1       1   1    1  3652523  3652641  22.18910
2       2   2    1  3682333  3682430  49.62240
3       3   3    1  3683060  3683180  43.28930
4       4   4    1  3698053  3698151   1.35354
5       5   5    1  3707549  3707791  55.28810
6       6   6    1  3722021  3722207  55.10700
7    tank  16    1        0        0   0.00000
-----
number: 2
     Name  id  chr    start      end  coverage
0  source   0    1        0        0    0.0000
1       1   1    1  9292888  9293071   15.2880
2       2   2    1  9355743  9356585   36.3796
3       3   3    1  9367448  9369525   26.6434
4    tank   4    1        0        0    0.0000
-----
number: 3
     Name  id  chr     start       end  coverage
0  source   0    1         0         0  0.000000
1       1   1    1  54779719  54781779  2.131000
2       2   2    1  54782325  54782477  3.084970
3       3   3    1  54785983  54786144  4.006170
4       4   4    1  54786521  54786956  0.688073
5       5   5    1  54786957  54787075  4.327730
6       6   6    1  54787711  54787826  2.913790
7       7   7    1  54788042  54788097  2.160710
8       8   8    1  54800597  54801260  2.615960
9    tank   9    1         0         0  0.000000

Approach方法

  • get each row as text into a DataFrame将每一行作为文本放入 DataFrame
  • find rows that contain breaks查找包含中断的行
  • insert into dict a dataframe for rows between each of the breaks在每个中断之间的行中插入 dataframe 到dict
import io
# create DF that is just raw text for each line in input
df = pd.DataFrame("""Name    id    chr    start    end    coverage
> Gene : 1
source  0       1       0       0       0
1       1       1       3652523 3652641 22.1891
2       2       1       3682333 3682430 49.6224
3       3       1       3683060 3683180 43.2893
4       4       1       3698053 3698151 1.35354
5       5       1       3707549 3707791 55.2881
6       6       1       3722021 3722207 55.107
tank    16      1       0       0       0
> Gene : 2
source  0       1       0       0       0
1       1       1       9292888 9293071 15.288
2       2       1       9355743 9356585 36.3796
3       3       1       9367448 9369525 26.6434
tank    4       1       0       0       0
> Gene : 3
source  0       1       0       0       0
1       1       1       54779719        54781779        2.131
2       2       1       54782325        54782477        3.08497
3       3       1       54785983        54786144        4.00617
4       4       1       54786521        54786956        0.688073
5       5       1       54786957        54787075        4.32773
6       6       1       54787711        54787826        2.91379
7       7       1       54788042        54788097        2.16071
8       8       1       54800597        54801260        2.61596
tank    9       1       0       0       0""".split("\n"),  columns=["raw"])
# find rows that are genes
bri = df[df["raw"].str.contains("> Gene :")].index
bri

dfs = {}
# insert into dict rows between each gene
for i, r in enumerate(bri):
    endrow=len(df) if i==(len(bri)-1) else bri[i+1]-1
    # NB used space delimiter not tab as pasted from SO
    dfs[df.loc[r]["raw"]] = \
        pd.DataFrame(df.loc[r+1:endrow]["raw"].str.split(r"[ ]+").tolist(),
                 columns=df.loc[0].str.split(r"[ ]+").tolist()
                )

dfs


output output

{'> Gene : 1':      Name  id chr    start      end coverage
 0  source   0   1        0        0        0
 1       1   1   1  3652523  3652641  22.1891
 2       2   2   1  3682333  3682430  49.6224
 3       3   3   1  3683060  3683180  43.2893
 4       4   4   1  3698053  3698151  1.35354
 5       5   5   1  3707549  3707791  55.2881
 6       6   6   1  3722021  3722207   55.107
 7    tank  16   1        0        0        0,
 '> Gene : 2':      Name id chr    start      end coverage
 0  source  0   1        0        0        0
 1       1  1   1  9292888  9293071   15.288
 2       2  2   1  9355743  9356585  36.3796
 3       3  3   1  9367448  9369525  26.6434
 4    tank  4   1        0        0        0,
 '> Gene : 3':      Name id chr     start       end  coverage
 0  source  0   1         0         0         0
 1       1  1   1  54779719  54781779     2.131
 2       2  2   1  54782325  54782477   3.08497
 3       3  3   1  54785983  54786144   4.00617
 4       4  4   1  54786521  54786956  0.688073
 5       5  5   1  54786957  54787075   4.32773
 6       6  6   1  54787711  54787826   2.91379
 7       7  7   1  54788042  54788097   2.16071
 8       8  8   1  54800597  54801260   2.61596
 9    tank  9   1         0         0         0}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM