简体   繁体   English

将具有特定格式的tsv解析为python

[英]Parse tsv with very specific format into python

I have a tsv file containing a network. 我有一个包含网络的tsv文件。 Here's a snippet. 这是一个片段。 Column 0 contains unique IDs, column 1 contains an alternative ID (not necessarily unique). 列0包含唯一ID,列1包含备用ID(不一定是唯一的)。 Each pair of columns after that contains an 'interactor' and a score of interaction. 之后的每一对列都包含一个“交互器”和一个交互分数。

11746909_a_at A1CF             SHPRH    0.11081568      TRIM10    0.11914056   
11736238_a_at ABCA5           ANKS1A     0.1333185     CCDC90B    0.14495682   
11724734_at   ABCB8             HYKK    0.09577321        LDB3    0.09845833   
11723976_at   ABCC8          FAM161B    0.15087105         ID1    0.14801268   
11718612_a_at ABCD4            HOXC6    0.23559235       LCMT2    0.12867001   
11758217_s_at ABHD17C           FZD7    0.46334574      HIVEP3    0.24272481 

So for example, A1CF connects to SHPRH and TRIM10 with scores of 0.11081568 and 0.11914056 respectively. 因此,例如, A1CF连接到SHPRHTRIM10的分数分别为0.110815680.11914056 I'm trying to convert this data into a 'flat' format using pandas which would look like this: 我正在尝试使用大熊猫将这些数据转换为“平面”格式,如下所示:

11746909_a_at    A1CF    SHPRH   0.11081568
                         TRIM10  0.11914056 
11736238_a_at    ABCA5   ANKS1A  0.1333185
                         CCDC90B 0.14495682
...... and so on........ ........ ....

Note that each row can have an arbitrary number of (interactor, score) pairs. 请注意,每行可以具有任意数量的(interactor, score)对。

I've tried setting columns 0 and 1 to indexes then giving the columns names df.colnames = ['Interactor', Weight']*int(df.shape[1]/2) then using pandas.groupby but so far my attempts have not been successful. 我试过将第0列和第1列设置为索引,然后给列名称指定df.colnames = ['Interactor', Weight']*int(df.shape[1]/2)然后使用pandas.groupby但到目前为止没有成功。 Can anybody suggest a way to do this? 有人可以建议一种方法吗?

Producing an output dataframe like you specified above shouldn't be too hard 产生像上面指定的输出数据帧应该不会太难

from collections import OrderedDict
import pandas as pd


def open_network_tsv(filepath):
    """
    Read the tsv file, returning every line split by tabs
    """
    with open(filepath) as network_file:
        for line in network_file.readlines():
            line_columns = line.strip().split('\t')
            yield line_columns

def get_connections(potential_conns):
    """
    Get the connections of a particular line, grouped
    in interactor:score pairs
    """
    for idx, val in enumerate(potential_conns):
        if not idx % 2:
            if len(potential_conns) >= idx + 2:
                yield val, potential_conns[idx+1]


def create_connections_df(filepath):
    """
    Build the desired dataframe
    """
    connections = OrderedDict({
        'uniq_id': [],
        'alias': [],
        'interactor': [],
        'score': []
    })
    for line in open_network_tsv(filepath):
        uniq_id, alias, *potential_conns = line
        for connection in get_connections(potential_conns):
            connections['uniq_id'].append(uniq_id)
            connections['alias'].append(alias)
            connections['interactor'].append(connection[0])
            connections['score'].append(connection[1])
    return pd.DataFrame(connections)

Maybe you can do a dataframe.set_index(['uniq_id', 'alias']) or dataframe.groupby(['uniq_id', 'alias']) on the output afterward 也许您之后可以在输出上执行dataframe.set_index(['uniq_id', 'alias'])dataframe.groupby(['uniq_id', 'alias'])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM