简体   繁体   中英

Python - How to add column elements to a list with a counter

in a text file with two columns, I have some rows like the following:

N     20
CA    20
C     20
O     20
CB    20
CG    20
CD    20
CE    20
NZ    20
N     21
CA    21
C     21
O     21
CB    21
SG    21

I created a nested dictionary in this way:

r_list = ['20', '21']
dictionary = {}
r_dict = {}
a_dict = {}
for r in range(0,len(r_list)):
    r = r_list[r]
    dictionary['C'] = r_dict
    r_dict[r] = a_dict

print dictionary

"""output:

{'C': {'20': {}, '21': {}}}

equal to:

dictionary = {'C': {
                    '20': {},
                    '21': {}
                }
            }
"""

Now, how to split the first column of the text file based on the reading of the relative second column? I would like to add the elements of the first column to a new list, until the counter finds '20' in the second column; after that, when counter finds the '21', it starts to add elements of the first column related with '21' in a new list, and so on ... In this way, I can then use these new sublists of elements like for "r_list", with other nested dictionaries, obtaining a final structure such as the following:

sublist_1 = ['N', 'CA', 'C', 'O', 'CB', 'CG', 'CD', 'CE', 'NZ']
sublist_2 = ['N', 'CA', 'C', 'O', 'CB', 'SG']

dictionary =    {'C' : {
                    '20': {
                        'N': {},
                        'CA': {},
                        'C': {},
                        'O': {},
                        'CB': {},
                        'CG': {},
                        'CD': {},
                        'CE': {},
                        'NZ': {}
                    },
                    '21': {
                        'N': {},
                        'CA': {},
                        'C': {},
                        'O': {},
                        'CB': {},
                        'SG': {}
                    }
                }
            }

How to do that?

Thanks a lot,

Riccardo

EDIT:

I applied all the solutions to an original cif file with success but, for the "label_atom_id" column (third column), in some cif file and for some atoms, there are quotes, like in the following eighth row and third column (starting from zero: "O5'") which remain in the dictionary:

ATOM   588  O  O4    . DT  B 2 10 ? 33.096 42.342 26.554 1.00 4.81  ? ? ? ? ? ? 29  DT  E O4    1 
ATOM   589  C  C5    . DT  B 2 10 ? 32.273 42.719 24.308 1.00 8.22  ? ? ? ? ? ? 29  DT  E C5    1 
ATOM   590  C  C7    . DT  B 2 10 ? 33.654 42.972 23.700 1.00 10.91 ? ? ? ? ? ? 29  DT  E C7    1 
ATOM   591  C  C6    . DT  B 2 10 ? 31.207 42.767 23.502 1.00 2.00  ? ? ? ? ? ? 29  DT  E C6    1 
ATOM   592  P  P     . DG  B 2 11 ? 25.446 44.301 21.417 1.00 28.24 ? ? ? ? ? ? 30  DG  E P     1 
ATOM   593  O  OP1   . DG  B 2 11 ? 24.109 43.692 21.128 1.00 19.20 ? ? ? ? ? ? 30  DG  E OP1   1 
ATOM   594  O  OP2   . DG  B 2 11 ? 26.212 45.060 20.381 1.00 24.94 ? ? ? ? ? ? 30  DG  E OP2   1 
ATOM   595  O  "O5'" . DG  B 2 11 ? 25.303 45.130 22.804 1.00 27.92 ? ? ? ? ? ? 30  DG  E "O5'" 1 
ATOM   596  C  "C5'" . DG  B 2 11 ? 24.694 44.453 23.923 1.00 19.87 ? ? ? ? ? ? 30  DG  E "C5'" 1 
ATOM   597  C  "C4'" . DG  B 2 11 ? 25.160 44.958 25.273 1.00 19.56 ? ? ? ? ? ? 30  DG  E "C4'" 1 
ATOM   598  O  "O4'" . DG  B 2 11 ? 26.506 44.513 25.519 1.00 22.77 ? ? ? ? ? ? 30  DG  E "O4'" 1 
ATOM   599  C  "C3'" . DG  B 2 11 ? 25.135 46.521 25.375 1.00 19.23 ? ? ? ? ? ? 30  DG  E "C3'" 1 
ATOM   600  O  "O3'" . DG  B 2 11 ? 24.620 46.792 26.672 1.00 20.19 ? ? ? ? ? ? 30  DG  E "O3'" 1 
ATOM   601  C  "C2'" . DG  B 2 11 ? 26.605 46.795 25.327 1.00 18.78 ? ? ? ? ? ? 30  DG  E "C2'" 1 
ATOM   602  C  "C1'" . DG  B 2 11 ? 27.116 45.634 26.159 1.00 21.24 ? ? ? ? ? ? 30  DG  E "C1'" 1 
ATOM   603  N  N9    . DG  B 2 11 ? 28.583 45.580 26.153 1.00 21.14 ? ? ? ? ? ? 30  DG  E N9    1

I tried to remove them from the file, to have only (O5), without success in this way:

with open(filename,"r") as f:
    lines = f.readlines()

for line in lines:
    column = line.split(None)
    atom = column[3]
    #print atom
    no_double_quotes = atom.replace('"', "").strip()
    #print no_double_quotes
    atom_cleaned = no_double_quotes.replace("'", "").strip()
    atom = atom_cleaned
    print atom

# and write everything back
with open(filename, 'w') as f:
    f.writelines(lines)

The console output is correct, but nothing is written into the file parsed for the dictionary... Is there a more efficient and working method?

EDIT 2 (FINAL):

I understood: the double quotation marks (when in the console is written '"O5 \\'") are embedding the apostrophe character (\\') used for the numbering of the atoms of the sugar (deoxyribose in that case) in the nucleotide, so I can not delete them, having a functional significance. Understood this, I solved then replacing the apostrophe character with its ASCII character (chr(39)), in this way:

for x in atom_record_rows_list:
    atom = x[3]
    #print atom
    no_double_quotes = atom.replace('"', "").strip()
    #print no_double_quotes
    atom_cleaned = no_double_quotes.replace("'", chr(39)).strip()
    x[3] = atom_cleaned
    print x[3]

dict = {"C": {y:{x[3]:{} for x in atom_record_rows_list if x[8] == y} for y in rlist}}
print dict

It sounds like you are making this more difficult than it needs to be. Can you just iterate over the lines in the file splitting them and just adding them to the dictionary:

dictionary = { 'C': { r : {} for r in ['20', '21'] }}
with open('<filename>', 'r') as file:
    for line in file:
        words = line.split()
        dictionary['C'][words[1]][words[0]] = {}

You can extract the sublists if you really need them:

sublist_1 = dictionary['C']['20'].keys()
sublist_2 = dictionary['C']['21'].keys()

However you have to remember that dictionaries are not ordered, so they will come out in a different order to what you have.

You can use dict comprehension to do this for you

inp = """N     20
CA    20
C     20
O     20
CB    20
CG    20
CD    20
CE    20
NZ    20
N     21
CA    21
C     21
O     21
CB    21
SG    21"""

mappings = [i.split() for i in inp.split("\n")]
rlist = set(x[1] for x in mappings)
dicts = {"C": {y:{x[0]:{} for x in mappings if  x[1] == y} for y in rlist}}

>>> print dicts
{'C': 
 {'20': {'C': {},
   'CA': {},
   'CB': {},
   'CD': {},
   'CE': {},
   'CG': {},
   'N': {},
   'NZ': {},
   'O': {}},
  '21': {'C': {}, 
   'CA': {}, 
   'CB': {}, 
   'N': {}, 
   'O': {}, 
   'SG': {}}
 }
}
  1. Read file by read method.
  2. Create result dictionary.
  3. Split file content by \\n ie split('\\n')
  4. Iterate every element from step 2 by for loop.
  5. Get two column value from every elements by split(" ")
  6. Add counter key in the result dictionary. in except block.
  7. Add element name dictionary in counter dictionary.

code:

with open("/home/infogrid/Desktop/Work/stack/input.txt", "r") as fp:
    data = fp.read()

result = {'C':{}}
for i in data.strip().split('\n'):
    val_count = [j for j in i.split(' ') if j]
    try:
        result['C'][val_count[1]][val_count[0]] = {}
    except KeyError:
        result['C'][val_count[1]] = {}
        result['C'][val_count[1]][val_count[0]] = {}

import pprint
pprint.pprint(result)

output:

{'C': {'20': {'C': {},
              'CA': {},
              'CB': {},
              'CD': {},
              'CE': {},
              'CG': {},
              'N': {},
              'NZ': {},
              'O': {}},
       '21': {'C': {}, 'CA': {}, 'CB': {}, 'N': {}, 'O': {}, 'SG': {}}}}

Use defaultdict module to remove try-except block from code. more info

>>> from collections import defaultdict
>>> result = {'C':defaultdict(dict)}
>>> result['C']['20']['CB'] = {}
>>> result['C']['20']
{'CB': {}}
>>> result['C']['21']
{}
>>> 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM