繁体   English   中英

python:基于关键字拆分文件

[英]python: splitting a file based on a key word

我有这个文件:

GSENumber   Species  Platform  Sample  Age  Tissue   Sex       Count
GSE11097    Rat     GPL1355 GSM280267   4   Liver   Male    Count
GSE11097    Rat     GPL1355 GSM280268   4   Liver   Female  Count
GSE11097    Rat     GPL1355 GSM280269   6   Liver   Male    Count
GSE11097    Rat     GPL1355 GSM280409   6   Liver   Female  Count
GSE11291    Mouse   GPL1261 GSM284967   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284968   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284969   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284970   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284975   10  Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284976   10  Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284987   5   Muscle  Male    Count
GSE11291    Mouse   GPL1261 GSM284988   5   Muscle  Female  Count
GSE11291    Mouse   GPL1261 GSM284989   30  Muscle  Male    Count
GSE11291    Mouse   GPL1261 GSM284990   30  Muscle  Male    Count
GSE11291    Mouse   GPL1261 GSM284991   30  Muscle  Male    Count

您可以在此处看到两个系列(GSE11097和GSE11291),我希望每个系列都有一个摘要。 对于每个“ GSE”编号,输出应该是这样的字典:

Series      Species  Platform AgeRange Tissue   Sex   Count
GSE11097    Rat     GPL1355     4-6    Liver    Mixed    Count
GSE11291    Mouse   GPL1261     5-10   Heart    Male     Count
GSE11291    Mouse   GPL1261     5-30   Muscle   Mixed    Count

因此,我知道一种实现方法是:

  1. 读入文件并列出所有GSE编号。
  2. 然后再次读入文件并根据GSE编号进行解析。

例如

import sys

list_of_series = list(set([line.strip().split()[0] for line in open(sys.argv[1])]))

list_of_dicts = []
for each_list in list_of_series:
    temp_dict={"species":"","platform":"","age":[],"tissue":"","Sex":[],"Count":""}
    for line in open(sys.argv[1]).readlines()[1:]:
          line = line.strip().split()
          if line[0] == each_list:
                temp_dict["species"] = line[1]
                temp_dict["platform"] = line[2]
                temp_dict["age"].append(line[4])
                temp_dict["tissue"] = line[5]
                temp_dict["sex"].append(line[6])
                temp_dict["count"] = line[7]

我认为这在两个方面都很麻烦:

  1. 我必须读取整个文件两次(实际上,文件比此处的示例大得多)

  2. 该方法可以用相同的单词重写相同的字典条目。

另外,性别存在问题,我想说“如果男性和女性都在字典中放入“混合”,否则,放入“男性”或“女性”。

我可以使此代码正常工作,但是我想知道使代码更简洁/更pythonic的快速提示吗?

我同意Max Paymar的看法,这应该以查询语言完成。 如果您真的想用Python做到这一点,pandas模块将大有帮助。

import pandas as pd

## columns widths of the fixed width format
fwidths = [12, 8, 8, 12, 4, 8, 8, 5]

## read fixed width format into pandas data frame
df = pd.read_fwf("your_file.txt", widths=fwidths, header=1,
                 names=["GSENumber", "Species", "Platform", "Sample",
                        "Age", "Tissue", "Sex", "Count"])

## drop "Sample" column as it is not needed in the output
df = df.drop("Sample", axis=1)

## group by GSENumber
grouped = df.groupby(df.GSENumber)

## aggregate columns for each group
aggregated = grouped.agg({'Species': lambda x: list(x.unique()),
                         'Platform': lambda x: list(x.unique()),
                         'Age': lambda x: "%d-%d" % (min(x), max(x)),
                         'Tissue': lambda x: list(x.unique()),
                         'Sex': lambda x: "Mixed" if x.nunique() > 1 else list(x.unique()),
                         'Count': lambda x: list(x.unique())})

print aggregated

这几乎产生了您所要求的结果,并且比纯Python解析文件要干净得多。

import sys

def main():
    data = read_data(open(sys.argv[1]))
    result = process_rows(data)
    format_and_print(result, sys.argv[2])

def read_data(file):
    data = [line.strip().split() for line in open(sys.argv[1])]
    data.pop(0) # remove header
    return data


def process_rows(data):
    data_dict = {}
    for row in data:
        process_row(row, data_dict)
    return data_dict

def process_row(row, data_dict):
    composite_key = row[0] + row[1] + row[5] #assuming this is how you are grouping the entries
    if composite_key in data_dict:
        data_dict[composite_key]['age_range'].add(row[4])
        if row[5] != data_dict[composite_key]:
            data_dict[composite_key]['sex'] = 'Mixed'

        #do you need to accumulate the counts? data_dict[composite_key]['count']+=row[6]

    else:
        data_dict[composite_key] = {
           'series': row[0],
           'species': row[1],
           'platform': row[2],
           'age_range': set([row[4]]),
           'tissue': row[5],
           'sex': row[6],
           'count': row[7]
        }

def format_and_print(data_dict, outfile):
    pass
    #you can implement this one :)


if __name__ == "__main__":
    main()

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM