简体   繁体   中英

python: splitting a file based on a key word

I have this file:

GSENumber   Species  Platform  Sample  Age  Tissue   Sex       Count
GSE11097    Rat     GPL1355 GSM280267   4   Liver   Male    Count
GSE11097    Rat     GPL1355 GSM280268   4   Liver   Female  Count
GSE11097    Rat     GPL1355 GSM280269   6   Liver   Male    Count
GSE11097    Rat     GPL1355 GSM280409   6   Liver   Female  Count
GSE11291    Mouse   GPL1261 GSM284967   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284968   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284969   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284970   5   Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284975   10  Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284976   10  Heart   Male    Count
GSE11291    Mouse   GPL1261 GSM284987   5   Muscle  Male    Count
GSE11291    Mouse   GPL1261 GSM284988   5   Muscle  Female  Count
GSE11291    Mouse   GPL1261 GSM284989   30  Muscle  Male    Count
GSE11291    Mouse   GPL1261 GSM284990   30  Muscle  Male    Count
GSE11291    Mouse   GPL1261 GSM284991   30  Muscle  Male    Count

You can see here there is two series (GSE11097 and GSE11291), and I want a summary for each series; The output should be a dictionary like this, for each "GSE" number:

Series      Species  Platform AgeRange Tissue   Sex   Count
GSE11097    Rat     GPL1355     4-6    Liver    Mixed    Count
GSE11291    Mouse   GPL1261     5-10   Heart    Male     Count
GSE11291    Mouse   GPL1261     5-30   Muscle   Mixed    Count

So I know one way to do this would be:

  1. Read in the file and make a list of all the GSE numbers.
  2. Then read in the file again and parse based on GSE number.

eg

import sys

list_of_series = list(set([line.strip().split()[0] for line in open(sys.argv[1])]))

list_of_dicts = []
for each_list in list_of_series:
    temp_dict={"species":"","platform":"","age":[],"tissue":"","Sex":[],"Count":""}
    for line in open(sys.argv[1]).readlines()[1:]:
          line = line.strip().split()
          if line[0] == each_list:
                temp_dict["species"] = line[1]
                temp_dict["platform"] = line[2]
                temp_dict["age"].append(line[4])
                temp_dict["tissue"] = line[5]
                temp_dict["sex"].append(line[6])
                temp_dict["count"] = line[7]

I think this is messy in two ways:

  1. I've to read in the whole file twice (in reality, file much bigger than example here)

  2. This method keeps re-writing over the same dictionary entry with the same word.

Also, There's a problem with the sex, I want to say "if both male and female, put "mixed" in dict, else, put "male" or "female".

I can make this code work, but I'm wondering about quick tips to make the code cleaner/more pythonic?

I agree with Max Paymar that this should be done in a query language. If you really want to do it in Python, the pandas module will help a lot.

import pandas as pd

## columns widths of the fixed width format
fwidths = [12, 8, 8, 12, 4, 8, 8, 5]

## read fixed width format into pandas data frame
df = pd.read_fwf("your_file.txt", widths=fwidths, header=1,
                 names=["GSENumber", "Species", "Platform", "Sample",
                        "Age", "Tissue", "Sex", "Count"])

## drop "Sample" column as it is not needed in the output
df = df.drop("Sample", axis=1)

## group by GSENumber
grouped = df.groupby(df.GSENumber)

## aggregate columns for each group
aggregated = grouped.agg({'Species': lambda x: list(x.unique()),
                         'Platform': lambda x: list(x.unique()),
                         'Age': lambda x: "%d-%d" % (min(x), max(x)),
                         'Tissue': lambda x: list(x.unique()),
                         'Sex': lambda x: "Mixed" if x.nunique() > 1 else list(x.unique()),
                         'Count': lambda x: list(x.unique())})

print aggregated

This produces pretty much the result you asked for and is much cleaner than parsing the file in pure Python.

import sys

def main():
    data = read_data(open(sys.argv[1]))
    result = process_rows(data)
    format_and_print(result, sys.argv[2])

def read_data(file):
    data = [line.strip().split() for line in open(sys.argv[1])]
    data.pop(0) # remove header
    return data


def process_rows(data):
    data_dict = {}
    for row in data:
        process_row(row, data_dict)
    return data_dict

def process_row(row, data_dict):
    composite_key = row[0] + row[1] + row[5] #assuming this is how you are grouping the entries
    if composite_key in data_dict:
        data_dict[composite_key]['age_range'].add(row[4])
        if row[5] != data_dict[composite_key]:
            data_dict[composite_key]['sex'] = 'Mixed'

        #do you need to accumulate the counts? data_dict[composite_key]['count']+=row[6]

    else:
        data_dict[composite_key] = {
           'series': row[0],
           'species': row[1],
           'platform': row[2],
           'age_range': set([row[4]]),
           'tissue': row[5],
           'sex': row[6],
           'count': row[7]
        }

def format_and_print(data_dict, outfile):
    pass
    #you can implement this one :)


if __name__ == "__main__":
    main()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM