I am attempting to generate a list-within-a-list. I am cycling through a file to update the list if one sublist element is greater. I have written this code:
targets = open(file)
longest_UTR = []
for line in targets:
chromosome, locus, mir, gene, transcript, UTR_length = line.strip("\n").split("\t")
length_as_integer = int(UTR_length)
if not any(x[:3] == [locus, mir, gene] for x in longest_UTR):
longest_UTR.append([locus, mir, gene, transcript, length_as_integer])
elif length_as_integer > [int(x[4]) for x in longest_UTR]: ##x[4] = previous length_as_integer
longest_UTR.append([locus, mir, gene, transcript, length_as_integer])
print (longest_UTR)
However, I get this error:
elif len_as_int > (int(x[4]) for x in longest_UTR):
TypeError: '>' not supported between instances of 'int' and 'generator'
How can I convert x[4]
to an integer so as to compare to length_as_integer
?
Thank you
If I get this right, try replacing the elif
line with the following:
else:
longest_UTR = [[locus, mir, gene, transcript, length_as_integer] for x in longest_UTR if x[:3] == [locus, mir, gene] and length_as_integer > int(x[4]) else x]:
You pass through all your list updating the ones matching the condition and doing nothing if it doesn't match.
So, there's been a bit of back and forth regarding your requirements, but my final understanding is this: You are looping over a data set. Each target
in this data set has a locus
, mri
, and gene
as well as a UTR_length
attribute. For every unique combination of locus
, mri
, and gene
you are trying to find all targets
that have the maximum UTR_Length
?
Given that you are wanting to find the maximum value in the dataset there are two approaches.
1) You could simply convert your input file to a pandas dataframe, group by you locus
, mri
and gene
values, and return all values with max( UTR_Length
). From ease of implementation this is probably your best bet. However, pandas is not always the right tool, and carries a lot of overhead, especially if you want to Dockerise your project.
2) If you want to use base python packages, I would recommend taking advantage of sets and dictionaries:
targets = open(file)
list_of_targets = []
for line in targets:
chromosome, locus, mir, gene, transcript, UTR_length = line.strip("\n").split("\t")
length_as_integer = int(UTR_length)
list_of_targets.append((chromosome, locus, mir, gene, transcript, UTR_length))
# Generate Set of unqiue locus, mri, gene (lmg) combinations
set_of_locus_mri_gene = {(i[1], i[2], i[3]) for i in list_of_targets}
# Generate dictionary of maximum lengths for each distinct lmg combo
dict_of_max_lengths = {lmg: max([targets[5] for targets in list_of_targets if
(targets[1], targets[2], targets[3]) == lmg]) for
lmg in set_of_locus_mri_gene}
# Generate dictionary with lmg keys and all targets with corresponding max length
final_output = {lmg: [target for target in list_of_targets if target[5] == max_length] for
lmg, max_length in dict_of_max_lengths.items()}
Since you want to replace the longest_UTR
variable and keep things nicely named you could use a dictionary instead of a list:
targets = open(file)
longest_UTR = {}
for line in targets:
chromosome, locus, mir, gene, transcript, UTR_length = line.strip("\n").split("\t")
length_as_integer = int(UTR_length)
# Your condition works for initializing the dictionary because of the default value.
if length_as_integer > longest_UTR.get("Length", -1):
longest_UTR["Chromosome"] = chromosome
longest_UTR["Locus"] = locus
longest_UTR["Mir"] = mir
longest_UTR["Gene"] = gene
longest_UTR["Transcript"] = transcript
longest_UTR["Length"] = length_as_integer
print (longest_UTR)
Edit: here is also the version of the code using a list, just in case you are interested to see the difference. Personally I find the dictionary approch cleaner to read.
targets = open(file)
longest_UTR = [None, None, None, None, None, -1]
for line in targets:
chromosome, locus, mir, gene, transcript, UTR_length = line.strip("\n").split("\t")
length_as_integer = int(UTR_length)
# Your condition works for initializing the list because of the default value.
if length_as_integer > longest_UTR[5]:
longest_UTR[0] = chromosome
longest_UTR[1] = locus
longest_UTR[2] = mir
longest_UTR[3] = gene
longest_UTR[4] = transcript
longest_UTR[5] = length_as_integer
print (longest_UTR)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.