I have this file:
GSENumber Species Platform Sample Age Tissue Sex Count
GSE11097 Rat GPL1355 GSM280267 4 Liver Male Count
GSE11097 Rat GPL1355 GSM280268 4 Liver Female Count
GSE11097 Rat GPL1355 GSM280269 6 Liver Male Count
GSE11097 Rat GPL1355 GSM280409 6 Liver Female Count
GSE11291 Mouse GPL1261 GSM284967 5 Heart Male Count
GSE11291 Mouse GPL1261 GSM284968 5 Heart Male Count
GSE11291 Mouse GPL1261 GSM284969 5 Heart Male Count
GSE11291 Mouse GPL1261 GSM284970 5 Heart Male Count
GSE11291 Mouse GPL1261 GSM284975 10 Heart Male Count
GSE11291 Mouse GPL1261 GSM284976 10 Heart Male Count
GSE11291 Mouse GPL1261 GSM284987 5 Muscle Male Count
GSE11291 Mouse GPL1261 GSM284988 5 Muscle Female Count
GSE11291 Mouse GPL1261 GSM284989 30 Muscle Male Count
GSE11291 Mouse GPL1261 GSM284990 30 Muscle Male Count
GSE11291 Mouse GPL1261 GSM284991 30 Muscle Male Count
You can see here there is two series (GSE11097 and GSE11291), and I want a summary for each series; The output should be a dictionary like this, for each "GSE" number:
Series Species Platform AgeRange Tissue Sex Count
GSE11097 Rat GPL1355 4-6 Liver Mixed Count
GSE11291 Mouse GPL1261 5-10 Heart Male Count
GSE11291 Mouse GPL1261 5-30 Muscle Mixed Count
So I know one way to do this would be:
eg
import sys
list_of_series = list(set([line.strip().split()[0] for line in open(sys.argv[1])]))
list_of_dicts = []
for each_list in list_of_series:
temp_dict={"species":"","platform":"","age":[],"tissue":"","Sex":[],"Count":""}
for line in open(sys.argv[1]).readlines()[1:]:
line = line.strip().split()
if line[0] == each_list:
temp_dict["species"] = line[1]
temp_dict["platform"] = line[2]
temp_dict["age"].append(line[4])
temp_dict["tissue"] = line[5]
temp_dict["sex"].append(line[6])
temp_dict["count"] = line[7]
I think this is messy in two ways:
I've to read in the whole file twice (in reality, file much bigger than example here)
This method keeps re-writing over the same dictionary entry with the same word.
Also, There's a problem with the sex, I want to say "if both male and female, put "mixed" in dict, else, put "male" or "female".
I can make this code work, but I'm wondering about quick tips to make the code cleaner/more pythonic?
I agree with Max Paymar that this should be done in a query language. If you really want to do it in Python, the pandas module will help a lot.
import pandas as pd
## columns widths of the fixed width format
fwidths = [12, 8, 8, 12, 4, 8, 8, 5]
## read fixed width format into pandas data frame
df = pd.read_fwf("your_file.txt", widths=fwidths, header=1,
names=["GSENumber", "Species", "Platform", "Sample",
"Age", "Tissue", "Sex", "Count"])
## drop "Sample" column as it is not needed in the output
df = df.drop("Sample", axis=1)
## group by GSENumber
grouped = df.groupby(df.GSENumber)
## aggregate columns for each group
aggregated = grouped.agg({'Species': lambda x: list(x.unique()),
'Platform': lambda x: list(x.unique()),
'Age': lambda x: "%d-%d" % (min(x), max(x)),
'Tissue': lambda x: list(x.unique()),
'Sex': lambda x: "Mixed" if x.nunique() > 1 else list(x.unique()),
'Count': lambda x: list(x.unique())})
print aggregated
This produces pretty much the result you asked for and is much cleaner than parsing the file in pure Python.
import sys
def main():
data = read_data(open(sys.argv[1]))
result = process_rows(data)
format_and_print(result, sys.argv[2])
def read_data(file):
data = [line.strip().split() for line in open(sys.argv[1])]
data.pop(0) # remove header
return data
def process_rows(data):
data_dict = {}
for row in data:
process_row(row, data_dict)
return data_dict
def process_row(row, data_dict):
composite_key = row[0] + row[1] + row[5] #assuming this is how you are grouping the entries
if composite_key in data_dict:
data_dict[composite_key]['age_range'].add(row[4])
if row[5] != data_dict[composite_key]:
data_dict[composite_key]['sex'] = 'Mixed'
#do you need to accumulate the counts? data_dict[composite_key]['count']+=row[6]
else:
data_dict[composite_key] = {
'series': row[0],
'species': row[1],
'platform': row[2],
'age_range': set([row[4]]),
'tissue': row[5],
'sex': row[6],
'count': row[7]
}
def format_and_print(data_dict, outfile):
pass
#you can implement this one :)
if __name__ == "__main__":
main()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.