python：基于关键字拆分文件

Question

我有这个文件：

GSENumber   Species  Platform  Sample  Age  Tissue   Sex       Count
GSE11097    Rat     GPL1355 GSM280267   4   Liver   Male    Count
GSE11097    Rat     GPL1355 GSM280268   4   Liver   Female  Count
GSE11097    Rat     GPL1355 GSM280269   6   Liver   Male    Count
GSE11097    Rat     GPL1355 GSM280409   6   Liver   Female  Count
GSE11291    Mouse   GPL1261 GSM284967   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284968   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284969   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284970   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284975   10  Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284976   10  Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284987   5   Muscle  Male    Count
GSE11291    Mouse   GPL1261 GSM284988   5   Muscle  Female  Count
GSE11291    Mouse   GPL1261 GSM284989   30  Muscle  Male    Count
GSE11291    Mouse   GPL1261 GSM284990   30  Muscle  Male    Count
GSE11291    Mouse   GPL1261 GSM284991   30  Muscle  Male    Count

您可以在此处看到两个系列（GSE11097和GSE11291），我希望每个系列都有一个摘要。 对于每个“ GSE”编号，输出应该是这样的字典：

Series      Species  Platform AgeRange Tissue   Sex   Count
GSE11097    Rat     GPL1355     4-6    Liver    Mixed    Count
GSE11291    Mouse   GPL1261     5-10   Heart    Male     Count
GSE11291    Mouse   GPL1261     5-30   Muscle   Mixed    Count

因此，我知道一种实现方法是：

读入文件并列出所有GSE编号。
然后再次读入文件并根据GSE编号进行解析。

例如

import sys

list_of_series = list(set([line.strip().split()[0] for line in open(sys.argv[1])]))

list_of_dicts = []
for each_list in list_of_series:
    temp_dict={"species":"","platform":"","age":[],"tissue":"","Sex":[],"Count":""}
    for line in open(sys.argv[1]).readlines()[1:]:
          line = line.strip().split()
          if line[0] == each_list:
                temp_dict["species"] = line[1]
                temp_dict["platform"] = line[2]
                temp_dict["age"].append(line[4])
                temp_dict["tissue"] = line[5]
                temp_dict["sex"].append(line[6])
                temp_dict["count"] = line[7]

我认为这在两个方面都很麻烦：

我必须读取整个文件两次（实际上，文件比此处的示例大得多）
该方法可以用相同的单词重写相同的字典条目。

另外，性别存在问题，我想说“如果男性和女性都在字典中放入“混合”，否则，放入“男性”或“女性”。

我可以使此代码正常工作，但是我想知道使代码更简洁/更pythonic的快速提示吗？

Answer 1

我同意Max Paymar的看法，这应该以查询语言完成。 如果您真的想用Python做到这一点，pandas模块将大有帮助。

import pandas as pd

## columns widths of the fixed width format
fwidths = [12, 8, 8, 12, 4, 8, 8, 5]

## read fixed width format into pandas data frame
df = pd.read_fwf("your_file.txt", widths=fwidths, header=1,
                 names=["GSENumber", "Species", "Platform", "Sample",
                        "Age", "Tissue", "Sex", "Count"])

## drop "Sample" column as it is not needed in the output
df = df.drop("Sample", axis=1)

## group by GSENumber
grouped = df.groupby(df.GSENumber)

## aggregate columns for each group
aggregated = grouped.agg({'Species': lambda x: list(x.unique()),
                         'Platform': lambda x: list(x.unique()),
                         'Age': lambda x: "%d-%d" % (min(x), max(x)),
                         'Tissue': lambda x: list(x.unique()),
                         'Sex': lambda x: "Mixed" if x.nunique() > 1 else list(x.unique()),
                         'Count': lambda x: list(x.unique())})

print aggregated

这几乎产生了您所要求的结果，并且比纯Python解析文件要干净得多。

Answer 2

import sys

def main():
    data = read_data(open(sys.argv[1]))
    result = process_rows(data)
    format_and_print(result, sys.argv[2])

def read_data(file):
    data = [line.strip().split() for line in open(sys.argv[1])]
    data.pop(0) # remove header
    return data


def process_rows(data):
    data_dict = {}
    for row in data:
        process_row(row, data_dict)
    return data_dict

def process_row(row, data_dict):
    composite_key = row[0] + row[1] + row[5] #assuming this is how you are grouping the entries
    if composite_key in data_dict:
        data_dict[composite_key]['age_range'].add(row[4])
        if row[5] != data_dict[composite_key]:
            data_dict[composite_key]['sex'] = 'Mixed'

        #do you need to accumulate the counts? data_dict[composite_key]['count']+=row[6]

    else:
        data_dict[composite_key] = {
           'series': row[0],
           'species': row[1],
           'platform': row[2],
           'age_range': set([row[4]]),
           'tissue': row[5],
           'sex': row[6],
           'count': row[7]
        }

def format_and_print(data_dict, outfile):
    pass
    #you can implement this one :)


if __name__ == "__main__":
    main()

python：基于关键字拆分文件

问题描述

2 个解决方案

解决方案1
0 2017-03-24 16:23:51

解决方案2
0 2017-03-24 16:24:02

python：基于关键字拆分文件

问题描述

2 个解决方案

解决方案1 0 2017-03-24 16:23:51

解决方案2 0 2017-03-24 16:24:02

解决方案1
0 2017-03-24 16:23:51

解决方案2
0 2017-03-24 16:24:02