python：基於關鍵字拆分文件

Question

我有這個文件：

GSENumber   Species  Platform  Sample  Age  Tissue   Sex       Count
GSE11097    Rat     GPL1355 GSM280267   4   Liver   Male    Count
GSE11097    Rat     GPL1355 GSM280268   4   Liver   Female  Count
GSE11097    Rat     GPL1355 GSM280269   6   Liver   Male    Count
GSE11097    Rat     GPL1355 GSM280409   6   Liver   Female  Count
GSE11291    Mouse   GPL1261 GSM284967   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284968   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284969   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284970   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284975   10  Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284976   10  Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284987   5   Muscle  Male    Count
GSE11291    Mouse   GPL1261 GSM284988   5   Muscle  Female  Count
GSE11291    Mouse   GPL1261 GSM284989   30  Muscle  Male    Count
GSE11291    Mouse   GPL1261 GSM284990   30  Muscle  Male    Count
GSE11291    Mouse   GPL1261 GSM284991   30  Muscle  Male    Count

您可以在此處看到兩個系列（GSE11097和GSE11291），我希望每個系列都有一個摘要。 對於每個“ GSE”編號，輸出應該是這樣的字典：

Series      Species  Platform AgeRange Tissue   Sex   Count
GSE11097    Rat     GPL1355     4-6    Liver    Mixed    Count
GSE11291    Mouse   GPL1261     5-10   Heart    Male     Count
GSE11291    Mouse   GPL1261     5-30   Muscle   Mixed    Count

因此，我知道一種實現方法是：

讀入文件並列出所有GSE編號。
然后再次讀入文件並根據GSE編號進行解析。

例如

import sys

list_of_series = list(set([line.strip().split()[0] for line in open(sys.argv[1])]))

list_of_dicts = []
for each_list in list_of_series:
    temp_dict={"species":"","platform":"","age":[],"tissue":"","Sex":[],"Count":""}
    for line in open(sys.argv[1]).readlines()[1:]:
          line = line.strip().split()
          if line[0] == each_list:
                temp_dict["species"] = line[1]
                temp_dict["platform"] = line[2]
                temp_dict["age"].append(line[4])
                temp_dict["tissue"] = line[5]
                temp_dict["sex"].append(line[6])
                temp_dict["count"] = line[7]

我認為這在兩個方面都很麻煩：

我必須讀取整個文件兩次（實際上，文件比此處的示例大得多）
該方法可以用相同的單詞重寫相同的字典條目。

另外，性別存在問題，我想說“如果男性和女性都在字典中放入“混合”，否則，放入“男性”或“女性”。

我可以使此代碼正常工作，但是我想知道使代碼更簡潔/更pythonic的快速提示嗎？

Answer 1

我同意Max Paymar的看法，這應該以查詢語言完成。 如果您真的想用Python做到這一點，pandas模塊將大有幫助。

import pandas as pd

## columns widths of the fixed width format
fwidths = [12, 8, 8, 12, 4, 8, 8, 5]

## read fixed width format into pandas data frame
df = pd.read_fwf("your_file.txt", widths=fwidths, header=1,
                 names=["GSENumber", "Species", "Platform", "Sample",
                        "Age", "Tissue", "Sex", "Count"])

## drop "Sample" column as it is not needed in the output
df = df.drop("Sample", axis=1)

## group by GSENumber
grouped = df.groupby(df.GSENumber)

## aggregate columns for each group
aggregated = grouped.agg({'Species': lambda x: list(x.unique()),
                         'Platform': lambda x: list(x.unique()),
                         'Age': lambda x: "%d-%d" % (min(x), max(x)),
                         'Tissue': lambda x: list(x.unique()),
                         'Sex': lambda x: "Mixed" if x.nunique() > 1 else list(x.unique()),
                         'Count': lambda x: list(x.unique())})

print aggregated

這幾乎產生了您所要求的結果，並且比純Python解析文件要干凈得多。

Answer 2

import sys

def main():
    data = read_data(open(sys.argv[1]))
    result = process_rows(data)
    format_and_print(result, sys.argv[2])

def read_data(file):
    data = [line.strip().split() for line in open(sys.argv[1])]
    data.pop(0) # remove header
    return data


def process_rows(data):
    data_dict = {}
    for row in data:
        process_row(row, data_dict)
    return data_dict

def process_row(row, data_dict):
    composite_key = row[0] + row[1] + row[5] #assuming this is how you are grouping the entries
    if composite_key in data_dict:
        data_dict[composite_key]['age_range'].add(row[4])
        if row[5] != data_dict[composite_key]:
            data_dict[composite_key]['sex'] = 'Mixed'

        #do you need to accumulate the counts? data_dict[composite_key]['count']+=row[6]

    else:
        data_dict[composite_key] = {
           'series': row[0],
           'species': row[1],
           'platform': row[2],
           'age_range': set([row[4]]),
           'tissue': row[5],
           'sex': row[6],
           'count': row[7]
        }

def format_and_print(data_dict, outfile):
    pass
    #you can implement this one :)


if __name__ == "__main__":
    main()

python：基於關鍵字拆分文件

問題描述

2 個解決方案

解決方案1
0 2017-03-24 16:23:51

解決方案2
0 2017-03-24 16:24:02

python：基於關鍵字拆分文件

問題描述

2 個解決方案

解決方案1 0 2017-03-24 16:23:51

解決方案2 0 2017-03-24 16:24:02

解決方案1
0 2017-03-24 16:23:51

解決方案2
0 2017-03-24 16:24:02