Let's say I have a file that looks like this:
Gene.name | Experiment.1 | Experiment.2 |
---|---|---|
A1BG | 0.031474 | 0.05776 |
ZNF621 | 0.091025 | 0.33516 |
ARHGAP12 | 0.97852 | 0.14098 |
and so on…
And another file that looks like this:
Gene Name | Gene description | Chromosome number | Chromosome location |
---|---|---|---|
A1BG | alpha-1B-glycoprotein | 19 | 19q13.43 |
A2M | alpha-2-macroglobulin | 12 | 12p13.31 |
A3M | alpha-3-macroglobulin | 12 | 12p13.33 |
and so on…
I have made 2 dictionaries, one that matches the gene name (key) with the gene annotation/description (value), and another that matches the gene name (key) with gene chromosome number (value).
My goal is to make an output file where I take the first table (the one with experiment.1 and experiment.2 as the columns) and append the gene chromosome and gene annotation information to the the table for each appropriate gene using the dictionaries I have created. So essentially, this would lead to an output file in the following format for every gene present in both files. If one gene is not present in a file, the last 2 fields should be NAs (like the second column in the below example)
Gene.name | Experiment.1 | Experiment.2 | Gene description | Chromosome number |
---|---|---|---|---|
A1BG | 0.03147 | 0.05776 | alpha-1B-glycoprotein | 19 |
ZNF621 | 0.091025 | 0.33516 | N/A | N/A |
I have set my dictionaries up in the following manner:
infile = open("human_gene_annotations.txt", "rt")
#separate header
gene_header = infile.readline()
#gene annotation dict
gene_annotations = {}
#use for loop to fill
for line in infile:
line = line.rstrip()
information = line.split("\t")
gene_annotations[information[0]] = {"Gene Description": information[1]}
#close infile
infile.close()
#open infile again for second dictionary
infile = open("human_gene_annotations.txt", "rt")
#separate header
gene_header = infile.readline()
#gene chroms dict
gene_chroms = {}
#use for loop to fill
for line in infile:
line = line.rstrip()
info_chrom = line.split("\t")
gene_chroms[info_chrom[0]] = {"Chromosome Number": info_chrom[2]}
#close infile
infile.close()
I have parsed the data from the first table (the one from the experiments) into lists like so:
genes = []
exp1values = []
exp2values = []
for line in infile:
line = line.rstrip()
fields = line.split("\t") # this will split the line we read by tab, thus by "column"
genes.append(fields[0])
exp1values.append(fields[1])
exp2values.append(fields[2])
Why not create a dictionary for the first table as well
I am using your existing code block that you have used for your for the second table with just one exception. As a value to the gene description dictionary and the chromosome number dictionary, I will just store the number and not the respective texts
infile = open("human_gene_annotations.txt", "rt")
#separate header
gene_header = infile.readline()
#gene annotation dict
gene_annotations = {}
#use for loop to fill
for line in infile:
line = line.rstrip()
information = line.split("\t")
gene_annotations[information[0]] = information[1]
#close infile
infile.close()
#open infile again for second dictionary
infile = open("human_gene_annotations.txt", "rt")
#separate header
gene_header = infile.readline()
#gene chroms dict
gene_chroms = {}
#use for loop to fill
for line in infile:
line = line.rstrip()
info_chrom = line.split("\t")
gene_chroms[info_chrom[0]] = info_chrom[2]
#close infile
infile.close()
Now for the first table I will make another 2 dictionaries for the two experiments
exp1map= {}
exp2map= {}
for line in infile:
line = line.rstrip()
fields = line.split("\t") # this will split the line we read by tab, thus by "column"
exp1map[fields[0]]=fields[1]
exp2map[fields[0]]=fields[2]
Now I do not know how exactly you want the output for output table but I am assuming you want to write the data to tab separated file
#Create a unique set of all genes from both table 1 and table 2
all_genes = set().union(*[exp1map.keys(),gene_annotations.keys()])
with open('output', 'w') as f:
f.write('\t'.join(['Gene.name','Experiment.1','Experiment.2','Gene description','Chromosome location']) + '\n')
for gene in all_genes:
exp1=exp1map.get(gene,None)
exp2=exp2map.get(gene,None)
desc=gene_annotations.get(gene,None)
chrom=gene_chroms.get(gene,None)
f.write('\t'.join([exp1,exp2,desc,chrom]) + '\n')
I could not test my code as I did not have the dataset but I think it solves your problem. Let me know if I you need more help or I have made a mistake
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.