简体   繁体   中英

Extracting columns from textfile in python: dealing with blank entries

I have a .txt file of the following shape. Impractically, unknown values are simply blank:

----Header---
Description, 
a few lines of description
Still description

  #  RESIDUE AA STRUCTURE BP1 BP2  
 1    79 A G              0    0    97      
 2    80 A A        -     0    0    28    
 3    81 A V  E     -A  134    0A   53    
 4    82 A F  E     -A  133    0A    6    
 5    83 A K  E     -A  132    0A   52    
11         !              0    0     0
12   101 A D  H           0    0   137

I want to extract the 2nd, 4th and 5th column, where nonexisting values should taken into account. So, what I want would be:

function(textfile,1,3,4)
>[79,80,81,82,83,"",101]
>["G","A","V","F","K","!","D"]
>["","","E","E","E","","H"]

The exact shape of the output does not matter, it could eg be anx 3 array or sth. Because of the bad choice of leaving unknowns blank, I cannot use np.loadtxt, because it would jump to the next column immediately.

Have you tried using pandas.read_csv with delimiters set to whitespace.

eg

pandas.read_csv(filename = 'filename.txt', delim_whitespace=True). 

It also looks like you are missing a column name.

You could investigate using Pandas as follows:

print pd.read_fwf('input.txt', widths=(4, 5, 2, 2, 3, 7, 5, 6, 5), usecols=[1, 3, 4], skiprows=6, header=None)

This would display:

      1  3    4
0   79.0  G  NaN
1   80.0  A  NaN
2   81.0  V    E
3   82.0  F    E
4   83.0  K    E
5    NaN  !  NaN
6  101.0  D    H

Alternatively you could just extract the necessary columns manually as follows:

import itertools

col_locations = [(3,8), (11, 12), (13,15)]

with open('input.txt') as f_input:
    # Skip over initial lines until the header row
    next(itertools.dropwhile(lambda x: "RESIDUE" not in x, f_input))
    lines = [row.rstrip() for row in f_input]

data = []    
for row in lines:
    data.append([row[start:end].strip() for start, end in col_locations])

data = zip(*data)       # Transpose the data
print data

This would give you a list as follows:

[('79', '80', '81', '82', '83', '', '101'), ('G', 'A', 'V', 'F', 'K', '!', 'D'), ('', '', 'E', 'E', 'E', '', 'H')]

If you really want the first column converted to numbers, you could apply a per column conversion function as follows:

import itertools

def num_convert(x):
    try:
        return int(x)
    except:
        return ''

col_locations = [(3, 8, num_convert), (11, 12, str.strip), (13, 15, str.strip)]

with open('input.txt') as f_input:
    # Skip over initial lines until the header row
    next(itertools.dropwhile(lambda x: "RESIDUE" not in x, f_input))
    lines = [row.rstrip() for row in f_input]

data = []    
for row in lines:
    data.append([conversion(row[start:end]) for start, end, conversion in col_locations])

data = zip(*data)       # Transpose the data
print data

Giving you:

[(79, 80, 81, 82, 83, '', 101), ('G', 'A', 'V', 'F', 'K', '!', 'D'), ('', '', 'E', 'E', 'E', '', 'H')]

You can use the struct module :

import struct
line = ' 5    83 A K  E     -A  132    0A   52    '
extracted_line = map(lambda x: x.strip(), struct.unpack("6s3s2s3s6s4s7s5s6s", line[:42])))
print(list(extracted_line))

Probably it will need some adjustments because I don't know if the as the values grow, they move left or right. But this is a way.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM