[英]python: splitting a file based on a key word
我有這個文件:
GSENumber Species Platform Sample Age Tissue Sex Count
GSE11097 Rat GPL1355 GSM280267 4 Liver Male Count
GSE11097 Rat GPL1355 GSM280268 4 Liver Female Count
GSE11097 Rat GPL1355 GSM280269 6 Liver Male Count
GSE11097 Rat GPL1355 GSM280409 6 Liver Female Count
GSE11291 Mouse GPL1261 GSM284967 5 Heart Male Count
GSE11291 Mouse GPL1261 GSM284968 5 Heart Male Count
GSE11291 Mouse GPL1261 GSM284969 5 Heart Male Count
GSE11291 Mouse GPL1261 GSM284970 5 Heart Male Count
GSE11291 Mouse GPL1261 GSM284975 10 Heart Male Count
GSE11291 Mouse GPL1261 GSM284976 10 Heart Male Count
GSE11291 Mouse GPL1261 GSM284987 5 Muscle Male Count
GSE11291 Mouse GPL1261 GSM284988 5 Muscle Female Count
GSE11291 Mouse GPL1261 GSM284989 30 Muscle Male Count
GSE11291 Mouse GPL1261 GSM284990 30 Muscle Male Count
GSE11291 Mouse GPL1261 GSM284991 30 Muscle Male Count
您可以在此處看到兩個系列(GSE11097和GSE11291),我希望每個系列都有一個摘要。 對於每個“ GSE”編號,輸出應該是這樣的字典:
Series Species Platform AgeRange Tissue Sex Count
GSE11097 Rat GPL1355 4-6 Liver Mixed Count
GSE11291 Mouse GPL1261 5-10 Heart Male Count
GSE11291 Mouse GPL1261 5-30 Muscle Mixed Count
因此,我知道一種實現方法是:
例如
import sys
list_of_series = list(set([line.strip().split()[0] for line in open(sys.argv[1])]))
list_of_dicts = []
for each_list in list_of_series:
temp_dict={"species":"","platform":"","age":[],"tissue":"","Sex":[],"Count":""}
for line in open(sys.argv[1]).readlines()[1:]:
line = line.strip().split()
if line[0] == each_list:
temp_dict["species"] = line[1]
temp_dict["platform"] = line[2]
temp_dict["age"].append(line[4])
temp_dict["tissue"] = line[5]
temp_dict["sex"].append(line[6])
temp_dict["count"] = line[7]
我認為這在兩個方面都很麻煩:
我必須讀取整個文件兩次(實際上,文件比此處的示例大得多)
該方法可以用相同的單詞重寫相同的字典條目。
另外,性別存在問題,我想說“如果男性和女性都在字典中放入“混合”,否則,放入“男性”或“女性”。
我可以使此代碼正常工作,但是我想知道使代碼更簡潔/更pythonic的快速提示嗎?
我同意Max Paymar的看法,這應該以查詢語言完成。 如果您真的想用Python做到這一點,pandas模塊將大有幫助。
import pandas as pd
## columns widths of the fixed width format
fwidths = [12, 8, 8, 12, 4, 8, 8, 5]
## read fixed width format into pandas data frame
df = pd.read_fwf("your_file.txt", widths=fwidths, header=1,
names=["GSENumber", "Species", "Platform", "Sample",
"Age", "Tissue", "Sex", "Count"])
## drop "Sample" column as it is not needed in the output
df = df.drop("Sample", axis=1)
## group by GSENumber
grouped = df.groupby(df.GSENumber)
## aggregate columns for each group
aggregated = grouped.agg({'Species': lambda x: list(x.unique()),
'Platform': lambda x: list(x.unique()),
'Age': lambda x: "%d-%d" % (min(x), max(x)),
'Tissue': lambda x: list(x.unique()),
'Sex': lambda x: "Mixed" if x.nunique() > 1 else list(x.unique()),
'Count': lambda x: list(x.unique())})
print aggregated
這幾乎產生了您所要求的結果,並且比純Python解析文件要干凈得多。
import sys
def main():
data = read_data(open(sys.argv[1]))
result = process_rows(data)
format_and_print(result, sys.argv[2])
def read_data(file):
data = [line.strip().split() for line in open(sys.argv[1])]
data.pop(0) # remove header
return data
def process_rows(data):
data_dict = {}
for row in data:
process_row(row, data_dict)
return data_dict
def process_row(row, data_dict):
composite_key = row[0] + row[1] + row[5] #assuming this is how you are grouping the entries
if composite_key in data_dict:
data_dict[composite_key]['age_range'].add(row[4])
if row[5] != data_dict[composite_key]:
data_dict[composite_key]['sex'] = 'Mixed'
#do you need to accumulate the counts? data_dict[composite_key]['count']+=row[6]
else:
data_dict[composite_key] = {
'series': row[0],
'species': row[1],
'platform': row[2],
'age_range': set([row[4]]),
'tissue': row[5],
'sex': row[6],
'count': row[7]
}
def format_and_print(data_dict, outfile):
pass
#you can implement this one :)
if __name__ == "__main__":
main()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.