簡體   English   中英

python:基於關鍵字拆分文件

[英]python: splitting a file based on a key word

我有這個文件:

GSENumber   Species  Platform  Sample  Age  Tissue   Sex       Count
GSE11097    Rat     GPL1355 GSM280267   4   Liver   Male    Count
GSE11097    Rat     GPL1355 GSM280268   4   Liver   Female  Count
GSE11097    Rat     GPL1355 GSM280269   6   Liver   Male    Count
GSE11097    Rat     GPL1355 GSM280409   6   Liver   Female  Count
GSE11291    Mouse   GPL1261 GSM284967   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284968   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284969   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284970   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284975   10  Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284976   10  Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284987   5   Muscle  Male    Count
GSE11291    Mouse   GPL1261 GSM284988   5   Muscle  Female  Count
GSE11291    Mouse   GPL1261 GSM284989   30  Muscle  Male    Count
GSE11291    Mouse   GPL1261 GSM284990   30  Muscle  Male    Count
GSE11291    Mouse   GPL1261 GSM284991   30  Muscle  Male    Count

您可以在此處看到兩個系列(GSE11097和GSE11291),我希望每個系列都有一個摘要。 對於每個“ GSE”編號,輸出應該是這樣的字典:

Series      Species  Platform AgeRange Tissue   Sex   Count
GSE11097    Rat     GPL1355     4-6    Liver    Mixed    Count
GSE11291    Mouse   GPL1261     5-10   Heart    Male     Count
GSE11291    Mouse   GPL1261     5-30   Muscle   Mixed    Count

因此,我知道一種實現方法是:

  1. 讀入文件並列出所有GSE編號。
  2. 然后再次讀入文件並根據GSE編號進行解析。

例如

import sys

list_of_series = list(set([line.strip().split()[0] for line in open(sys.argv[1])]))

list_of_dicts = []
for each_list in list_of_series:
    temp_dict={"species":"","platform":"","age":[],"tissue":"","Sex":[],"Count":""}
    for line in open(sys.argv[1]).readlines()[1:]:
          line = line.strip().split()
          if line[0] == each_list:
                temp_dict["species"] = line[1]
                temp_dict["platform"] = line[2]
                temp_dict["age"].append(line[4])
                temp_dict["tissue"] = line[5]
                temp_dict["sex"].append(line[6])
                temp_dict["count"] = line[7]

我認為這在兩個方面都很麻煩:

  1. 我必須讀取整個文件兩次(實際上,文件比此處的示例大得多)

  2. 該方法可以用相同的單詞重寫相同的字典條目。

另外,性別存在問題,我想說“如果男性和女性都在字典中放入“混合”,否則,放入“男性”或“女性”。

我可以使此代碼正常工作,但是我想知道使代碼更簡潔/更pythonic的快速提示嗎?

我同意Max Paymar的看法,這應該以查詢語言完成。 如果您真的想用Python做到這一點,pandas模塊將大有幫助。

import pandas as pd

## columns widths of the fixed width format
fwidths = [12, 8, 8, 12, 4, 8, 8, 5]

## read fixed width format into pandas data frame
df = pd.read_fwf("your_file.txt", widths=fwidths, header=1,
                 names=["GSENumber", "Species", "Platform", "Sample",
                        "Age", "Tissue", "Sex", "Count"])

## drop "Sample" column as it is not needed in the output
df = df.drop("Sample", axis=1)

## group by GSENumber
grouped = df.groupby(df.GSENumber)

## aggregate columns for each group
aggregated = grouped.agg({'Species': lambda x: list(x.unique()),
                         'Platform': lambda x: list(x.unique()),
                         'Age': lambda x: "%d-%d" % (min(x), max(x)),
                         'Tissue': lambda x: list(x.unique()),
                         'Sex': lambda x: "Mixed" if x.nunique() > 1 else list(x.unique()),
                         'Count': lambda x: list(x.unique())})

print aggregated

這幾乎產生了您所要求的結果,並且比純Python解析文件要干凈得多。

import sys

def main():
    data = read_data(open(sys.argv[1]))
    result = process_rows(data)
    format_and_print(result, sys.argv[2])

def read_data(file):
    data = [line.strip().split() for line in open(sys.argv[1])]
    data.pop(0) # remove header
    return data


def process_rows(data):
    data_dict = {}
    for row in data:
        process_row(row, data_dict)
    return data_dict

def process_row(row, data_dict):
    composite_key = row[0] + row[1] + row[5] #assuming this is how you are grouping the entries
    if composite_key in data_dict:
        data_dict[composite_key]['age_range'].add(row[4])
        if row[5] != data_dict[composite_key]:
            data_dict[composite_key]['sex'] = 'Mixed'

        #do you need to accumulate the counts? data_dict[composite_key]['count']+=row[6]

    else:
        data_dict[composite_key] = {
           'series': row[0],
           'species': row[1],
           'platform': row[2],
           'age_range': set([row[4]]),
           'tissue': row[5],
           'sex': row[6],
           'count': row[7]
        }

def format_and_print(data_dict, outfile):
    pass
    #you can implement this one :)


if __name__ == "__main__":
    main()

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM