简体   繁体   中英

Python join elements with same index

I have a file with few thousand lines. I'd like to populate the dictionary line by line. The gene can work as a key. If the gene is found, it would append only the "rest" as values. I'd like to join the values for example with comma. This is where I'm now.

listfile = {}

with open("Desktop/testfile", "r") as f:
    for lines in f:
        lines=lines.strip()
        gene=lines.split()[0]
        rest = lines.split()[1:]


        if gene not in listfile:
            listfile[gene] = rest
            #print gene, rest
        else:
            for items in rest:

                listfile[gene].append(items)    


for items in listfile.items():
    print items

input:

ACCA    39072094753 D   12
ACCA    983954875454    G   11
ACCA    098540980985    F   22

output:

('ACCA', ['39072094753', 'D', '12', '983954875454', 'G', '11', '098540980985', 'F', '22'])

expected output:

('ACCA', ['39072094753','983954875454','098540980985' 'D','G','F', '12','11','22'])

Here is a general solution that works with any number of columns in the input file:

import collections
import itertools

genes_info = collections.defaultdict(list)

with open("testfile") as genes_file:
    for line in genes_file:
        fields = line.split()
        genes_info[fields[0]].append(fields[1:])  # Stores each row information

# Conversion of the row-first gene information into column-first information:
for gene_info in genes_info.itervalues():
    gene_info[:] = itertools.chain(*zip(*gene_info))

print genes_info

gives

{'ACCA': ['39072094753', '983954875454', '098540980985', 'D', 'G', 'F', '12', '11', '22']}

(If you need a dictionary instead of a mostly equivalent defaultdict you can add at the end genes_info = dict(genes_info) .)

If you want to keep column values together, use instead the simpler gene_info[:] = zip(*gene_info) . This gives:

{'ACCA': [('39072094753', '983954875454', '098540980985'), ('D', 'G', 'F'), ('12', '11', '22')]}

In fact, zip() essentially transforms rows into columns.

PS : line.split() automatically removes empty strings, so the final newline is in effect automatically removed: I simplified my original line.strip().split() , where strip() was therefore unnecessary.

I'm guessing, you have same amount of space separated values in each line. if not, the longest of them will be used for zip.

from __future__ import print_function 
import itertools
listfile = {}

with open("Desktop/testfile", "r") as f:
    for line in f:
        line = line.strip().split()
        gene = line[0]
        rest = line[1:]

        if gene not in listfile:
            listfile[gene] = []
        listfile[gene].append(rest)

for i in listfile:
    x = i.get()
    print(i, list(itertools.chain(*itertools.izip_longest(*x))))

Here's how you do it.

openedFile = open('data.txt', 'r')

largeNumber = []
letter = []
smallNumber = []

for line in openedFile:
    splittedContent = line.split()
    largeNumber.append(splittedContent[1])
    letter.append(splittedContent[2])
    smallNumber.append(splittedContent[3])

print ('ACCA', largeNumber + letter + smallNumber)

Output:

('ACCA', ['39072094753', '983954875454', '098540980985', 'D', 'G', 'F', '12', '11', '22'])

If you just need the comma seperated string for the output, you can just do:

print ",".join(listfile.items())

I think for further processing it would be usefull to keep the attributes in a list.

Looks like a good usecase for a defaultdict

from from collections import defaultdict
listfile = defaultdict(lambda : [])

with open("Desktop/testfile", "r") as f:
    all_lines = (l.split for l in f)
    for line in all_lines:
        first = line[0]
        rest = line[1:]
        listfile[first].extend(rest)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM