[英]python: splitting a file based on a key word
我有这个文件:
GSENumber Species Platform Sample Age Tissue Sex Count
GSE11097 Rat GPL1355 GSM280267 4 Liver Male Count
GSE11097 Rat GPL1355 GSM280268 4 Liver Female Count
GSE11097 Rat GPL1355 GSM280269 6 Liver Male Count
GSE11097 Rat GPL1355 GSM280409 6 Liver Female Count
GSE11291 Mouse GPL1261 GSM284967 5 Heart Male Count
GSE11291 Mouse GPL1261 GSM284968 5 Heart Male Count
GSE11291 Mouse GPL1261 GSM284969 5 Heart Male Count
GSE11291 Mouse GPL1261 GSM284970 5 Heart Male Count
GSE11291 Mouse GPL1261 GSM284975 10 Heart Male Count
GSE11291 Mouse GPL1261 GSM284976 10 Heart Male Count
GSE11291 Mouse GPL1261 GSM284987 5 Muscle Male Count
GSE11291 Mouse GPL1261 GSM284988 5 Muscle Female Count
GSE11291 Mouse GPL1261 GSM284989 30 Muscle Male Count
GSE11291 Mouse GPL1261 GSM284990 30 Muscle Male Count
GSE11291 Mouse GPL1261 GSM284991 30 Muscle Male Count
您可以在此处看到两个系列(GSE11097和GSE11291),我希望每个系列都有一个摘要。 对于每个“ GSE”编号,输出应该是这样的字典:
Series Species Platform AgeRange Tissue Sex Count
GSE11097 Rat GPL1355 4-6 Liver Mixed Count
GSE11291 Mouse GPL1261 5-10 Heart Male Count
GSE11291 Mouse GPL1261 5-30 Muscle Mixed Count
因此,我知道一种实现方法是:
例如
import sys
list_of_series = list(set([line.strip().split()[0] for line in open(sys.argv[1])]))
list_of_dicts = []
for each_list in list_of_series:
temp_dict={"species":"","platform":"","age":[],"tissue":"","Sex":[],"Count":""}
for line in open(sys.argv[1]).readlines()[1:]:
line = line.strip().split()
if line[0] == each_list:
temp_dict["species"] = line[1]
temp_dict["platform"] = line[2]
temp_dict["age"].append(line[4])
temp_dict["tissue"] = line[5]
temp_dict["sex"].append(line[6])
temp_dict["count"] = line[7]
我认为这在两个方面都很麻烦:
我必须读取整个文件两次(实际上,文件比此处的示例大得多)
该方法可以用相同的单词重写相同的字典条目。
另外,性别存在问题,我想说“如果男性和女性都在字典中放入“混合”,否则,放入“男性”或“女性”。
我可以使此代码正常工作,但是我想知道使代码更简洁/更pythonic的快速提示吗?
我同意Max Paymar的看法,这应该以查询语言完成。 如果您真的想用Python做到这一点,pandas模块将大有帮助。
import pandas as pd
## columns widths of the fixed width format
fwidths = [12, 8, 8, 12, 4, 8, 8, 5]
## read fixed width format into pandas data frame
df = pd.read_fwf("your_file.txt", widths=fwidths, header=1,
names=["GSENumber", "Species", "Platform", "Sample",
"Age", "Tissue", "Sex", "Count"])
## drop "Sample" column as it is not needed in the output
df = df.drop("Sample", axis=1)
## group by GSENumber
grouped = df.groupby(df.GSENumber)
## aggregate columns for each group
aggregated = grouped.agg({'Species': lambda x: list(x.unique()),
'Platform': lambda x: list(x.unique()),
'Age': lambda x: "%d-%d" % (min(x), max(x)),
'Tissue': lambda x: list(x.unique()),
'Sex': lambda x: "Mixed" if x.nunique() > 1 else list(x.unique()),
'Count': lambda x: list(x.unique())})
print aggregated
这几乎产生了您所要求的结果,并且比纯Python解析文件要干净得多。
import sys
def main():
data = read_data(open(sys.argv[1]))
result = process_rows(data)
format_and_print(result, sys.argv[2])
def read_data(file):
data = [line.strip().split() for line in open(sys.argv[1])]
data.pop(0) # remove header
return data
def process_rows(data):
data_dict = {}
for row in data:
process_row(row, data_dict)
return data_dict
def process_row(row, data_dict):
composite_key = row[0] + row[1] + row[5] #assuming this is how you are grouping the entries
if composite_key in data_dict:
data_dict[composite_key]['age_range'].add(row[4])
if row[5] != data_dict[composite_key]:
data_dict[composite_key]['sex'] = 'Mixed'
#do you need to accumulate the counts? data_dict[composite_key]['count']+=row[6]
else:
data_dict[composite_key] = {
'series': row[0],
'species': row[1],
'platform': row[2],
'age_range': set([row[4]]),
'tissue': row[5],
'sex': row[6],
'count': row[7]
}
def format_and_print(data_dict, outfile):
pass
#you can implement this one :)
if __name__ == "__main__":
main()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.