I have a file with few thousand lines. I'd like to populate the dictionary line by line. The gene can work as a key. If the gene is found, it would append only the "rest" as values. I'd like to join the values for example with comma. This is where I'm now.
listfile = {}
with open("Desktop/testfile", "r") as f:
for lines in f:
lines=lines.strip()
gene=lines.split()[0]
rest = lines.split()[1:]
if gene not in listfile:
listfile[gene] = rest
#print gene, rest
else:
for items in rest:
listfile[gene].append(items)
for items in listfile.items():
print items
input:
ACCA 39072094753 D 12
ACCA 983954875454 G 11
ACCA 098540980985 F 22
output:
('ACCA', ['39072094753', 'D', '12', '983954875454', 'G', '11', '098540980985', 'F', '22'])
expected output:
('ACCA', ['39072094753','983954875454','098540980985' 'D','G','F', '12','11','22'])
Here is a general solution that works with any number of columns in the input file:
import collections
import itertools
genes_info = collections.defaultdict(list)
with open("testfile") as genes_file:
for line in genes_file:
fields = line.split()
genes_info[fields[0]].append(fields[1:]) # Stores each row information
# Conversion of the row-first gene information into column-first information:
for gene_info in genes_info.itervalues():
gene_info[:] = itertools.chain(*zip(*gene_info))
print genes_info
gives
{'ACCA': ['39072094753', '983954875454', '098540980985', 'D', 'G', 'F', '12', '11', '22']}
(If you need a dictionary instead of a mostly equivalent defaultdict you can add at the end genes_info = dict(genes_info)
.)
If you want to keep column values together, use instead the simpler gene_info[:] = zip(*gene_info)
. This gives:
{'ACCA': [('39072094753', '983954875454', '098540980985'), ('D', 'G', 'F'), ('12', '11', '22')]}
In fact, zip()
essentially transforms rows into columns.
PS : line.split()
automatically removes empty strings, so the final newline is in effect automatically removed: I simplified my original line.strip().split()
, where strip()
was therefore unnecessary.
I'm guessing, you have same amount of space separated values in each line. if not, the longest of them will be used for zip.
from __future__ import print_function
import itertools
listfile = {}
with open("Desktop/testfile", "r") as f:
for line in f:
line = line.strip().split()
gene = line[0]
rest = line[1:]
if gene not in listfile:
listfile[gene] = []
listfile[gene].append(rest)
for i in listfile:
x = i.get()
print(i, list(itertools.chain(*itertools.izip_longest(*x))))
Here's how you do it.
openedFile = open('data.txt', 'r')
largeNumber = []
letter = []
smallNumber = []
for line in openedFile:
splittedContent = line.split()
largeNumber.append(splittedContent[1])
letter.append(splittedContent[2])
smallNumber.append(splittedContent[3])
print ('ACCA', largeNumber + letter + smallNumber)
Output:
('ACCA', ['39072094753', '983954875454', '098540980985', 'D', 'G', 'F', '12', '11', '22'])
If you just need the comma seperated string for the output, you can just do:
print ",".join(listfile.items())
I think for further processing it would be usefull to keep the attributes in a list.
Looks like a good usecase for a defaultdict
from from collections import defaultdict
listfile = defaultdict(lambda : [])
with open("Desktop/testfile", "r") as f:
all_lines = (l.split for l in f)
for line in all_lines:
first = line[0]
rest = line[1:]
listfile[first].extend(rest)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.