Python join elements with same index

Question

I have a file with few thousand lines. I'd like to populate the dictionary line by line. The gene can work as a key. If the gene is found, it would append only the "rest" as values. I'd like to join the values for example with comma. This is where I'm now.

listfile = {}

with open("Desktop/testfile", "r") as f:
    for lines in f:
        lines=lines.strip()
        gene=lines.split()[0]
        rest = lines.split()[1:]


        if gene not in listfile:
            listfile[gene] = rest
            #print gene, rest
        else:
            for items in rest:

                listfile[gene].append(items)    


for items in listfile.items():
    print items

input:

ACCA    39072094753 D   12
ACCA    983954875454    G   11
ACCA    098540980985    F   22

output:

('ACCA', ['39072094753', 'D', '12', '983954875454', 'G', '11', '098540980985', 'F', '22'])

expected output:

('ACCA', ['39072094753','983954875454','098540980985' 'D','G','F', '12','11','22'])

Answer 1

Here is a general solution that works with any number of columns in the input file:

import collections
import itertools

genes_info = collections.defaultdict(list)

with open("testfile") as genes_file:
    for line in genes_file:
        fields = line.split()
        genes_info[fields[0]].append(fields[1:])  # Stores each row information

# Conversion of the row-first gene information into column-first information:
for gene_info in genes_info.itervalues():
    gene_info[:] = itertools.chain(*zip(*gene_info))

print genes_info

gives

{'ACCA': ['39072094753', '983954875454', '098540980985', 'D', 'G', 'F', '12', '11', '22']}

(If you need a dictionary instead of a mostly equivalent defaultdict you can add at the end genes_info = dict(genes_info) .)

If you want to keep column values together, use instead the simpler gene_info[:] = zip(*gene_info) . This gives:

{'ACCA': [('39072094753', '983954875454', '098540980985'), ('D', 'G', 'F'), ('12', '11', '22')]}

In fact, zip() essentially transforms rows into columns.

PS : line.split() automatically removes empty strings, so the final newline is in effect automatically removed: I simplified my original line.strip().split() , where strip() was therefore unnecessary.

Answer 2

I'm guessing, you have same amount of space separated values in each line. if not, the longest of them will be used for zip.

from __future__ import print_function 
import itertools
listfile = {}

with open("Desktop/testfile", "r") as f:
    for line in f:
        line = line.strip().split()
        gene = line[0]
        rest = line[1:]

        if gene not in listfile:
            listfile[gene] = []
        listfile[gene].append(rest)

for i in listfile:
    x = i.get()
    print(i, list(itertools.chain(*itertools.izip_longest(*x))))

Answer 3

Here's how you do it.

openedFile = open('data.txt', 'r')

largeNumber = []
letter = []
smallNumber = []

for line in openedFile:
    splittedContent = line.split()
    largeNumber.append(splittedContent[1])
    letter.append(splittedContent[2])
    smallNumber.append(splittedContent[3])

print ('ACCA', largeNumber + letter + smallNumber)

Output:

('ACCA', ['39072094753', '983954875454', '098540980985', 'D', 'G', 'F', '12', '11', '22'])

Answer 4

If you just need the comma seperated string for the output, you can just do:

print ",".join(listfile.items())

I think for further processing it would be usefull to keep the attributes in a list.

Answer 5

Looks like a good usecase for a defaultdict

from from collections import defaultdict
listfile = defaultdict(lambda : [])

with open("Desktop/testfile", "r") as f:
    all_lines = (l.split for l in f)
    for line in all_lines:
        first = line[0]
        rest = line[1:]
        listfile[first].extend(rest)

Python join elements with same index

Question

5 answers

solution1
1 ACCPTED 2015-03-23 08:58:00

solution2
1 2015-03-23 09:01:55

solution3
0 2015-03-23 09:04:29

solution4
-1 2015-03-23 08:53:02

solution5
-1 2015-03-23 09:04:17

Python join elements with same index

Question

5 answers

solution1 1 ACCPTED 2015-03-23 08:58:00

solution2 1 2015-03-23 09:01:55

solution3 0 2015-03-23 09:04:29

solution4 -1 2015-03-23 08:53:02

solution5 -1 2015-03-23 09:04:17

solution1
1 ACCPTED 2015-03-23 08:58:00

solution2
1 2015-03-23 09:01:55

solution3
0 2015-03-23 09:04:29

solution4
-1 2015-03-23 08:53:02

solution5
-1 2015-03-23 09:04:17