[英]Parse tsv with very specific format into python
I have a tsv file containing a network. 我有一个包含网络的tsv文件。 Here's a snippet.
这是一个片段。 Column 0 contains unique IDs, column 1 contains an alternative ID (not necessarily unique).
列0包含唯一ID,列1包含备用ID(不一定是唯一的)。 Each pair of columns after that contains an 'interactor' and a score of interaction.
之后的每一对列都包含一个“交互器”和一个交互分数。
11746909_a_at A1CF SHPRH 0.11081568 TRIM10 0.11914056
11736238_a_at ABCA5 ANKS1A 0.1333185 CCDC90B 0.14495682
11724734_at ABCB8 HYKK 0.09577321 LDB3 0.09845833
11723976_at ABCC8 FAM161B 0.15087105 ID1 0.14801268
11718612_a_at ABCD4 HOXC6 0.23559235 LCMT2 0.12867001
11758217_s_at ABHD17C FZD7 0.46334574 HIVEP3 0.24272481
So for example, A1CF
connects to SHPRH
and TRIM10
with scores of 0.11081568
and 0.11914056
respectively. 因此,例如,
A1CF
连接到SHPRH
和TRIM10
的分数分别为0.11081568
和0.11914056
。 I'm trying to convert this data into a 'flat' format using pandas which would look like this: 我正在尝试使用大熊猫将这些数据转换为“平面”格式,如下所示:
11746909_a_at A1CF SHPRH 0.11081568
TRIM10 0.11914056
11736238_a_at ABCA5 ANKS1A 0.1333185
CCDC90B 0.14495682
...... and so on........ ........ ....
Note that each row can have an arbitrary number of (interactor, score)
pairs. 请注意,每行可以具有任意数量的
(interactor, score)
对。
I've tried setting columns 0 and 1 to indexes then giving the columns names df.colnames = ['Interactor', Weight']*int(df.shape[1]/2)
then using pandas.groupby
but so far my attempts have not been successful. 我试过将第0列和第1列设置为索引,然后给列名称指定
df.colnames = ['Interactor', Weight']*int(df.shape[1]/2)
然后使用pandas.groupby
但到目前为止没有成功。 Can anybody suggest a way to do this? 有人可以建议一种方法吗?
Producing an output dataframe like you specified above shouldn't be too hard 产生像上面指定的输出数据帧应该不会太难
from collections import OrderedDict
import pandas as pd
def open_network_tsv(filepath):
"""
Read the tsv file, returning every line split by tabs
"""
with open(filepath) as network_file:
for line in network_file.readlines():
line_columns = line.strip().split('\t')
yield line_columns
def get_connections(potential_conns):
"""
Get the connections of a particular line, grouped
in interactor:score pairs
"""
for idx, val in enumerate(potential_conns):
if not idx % 2:
if len(potential_conns) >= idx + 2:
yield val, potential_conns[idx+1]
def create_connections_df(filepath):
"""
Build the desired dataframe
"""
connections = OrderedDict({
'uniq_id': [],
'alias': [],
'interactor': [],
'score': []
})
for line in open_network_tsv(filepath):
uniq_id, alias, *potential_conns = line
for connection in get_connections(potential_conns):
connections['uniq_id'].append(uniq_id)
connections['alias'].append(alias)
connections['interactor'].append(connection[0])
connections['score'].append(connection[1])
return pd.DataFrame(connections)
Maybe you can do a dataframe.set_index(['uniq_id', 'alias'])
or dataframe.groupby(['uniq_id', 'alias'])
on the output afterward 也许您之后可以在输出上执行
dataframe.set_index(['uniq_id', 'alias'])
或dataframe.groupby(['uniq_id', 'alias'])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.