简体   繁体   中英

How to I extract floats from a file in Python?

So, I have a file that looks like this:

# 3e98.mtz MR_AUTO with model 200la_.pdb
SPACegroup HALL P 2yb #P 1 21 1
SOLU SET RFZ=3.0 TFZ=4.7 PAK=0 LLG=30
SOLU 6DIM ENSE 200la_ EULER 321.997 124.066 234.744 FRAC -0.14681 0.50245 -0.05722
SOLU SET RFZ=3.3 TFZ=4.2 PAK=0 LLG=30
SOLU 6DIM ENSE 200la_ EULER 329.492 34.325 209.775 FRAC 0.70297 0.00106 -0.24023
SOLU SET RFZ=3.6 TFZ=3.6 PAK=0 LLG=30
SOLU 6DIM ENSE 200la_ EULER 177.344 78.287 187.356 FRAC 0.04890 0.00090 -0.57497

What's the best way to iterate through this file and extract only the floating point numbers?

The 'best' scenario in this case, would be extracting only the numbers similar to "321.997" (which are virus cell structure coordinates) and adding them to a list. In each file that I am looking at, there is 6 numbers like that in each line. After I pull those numbers, I can use the list in a method I've already written to calculate new coordinates for rotating the cell structure to match others in a data set.

Here's one way.

def floats( aList ):
    for v in aList:
        try:
            yield float(v)
        except ValueError:
            pass

a = list( floats( [....] ) )
floats = []
all = ['#', '3e98.mtz', 'MR_AUTO', 'with', 'model', '200la_.pdb', 'SPACegroup', 'HALL', 'P', '2yb', '#P', '1', '21', '1', 'SOLU', 'SET', 'RFZ=3.0', 'TFZ=4.7', 'PAK=0', 'LLG=30', 'SOLU', '6DIM', 'ENSE', '200la_', 'EULER', '321.997', '124.066', '234.744', 'FRAC', '-0.14681', '0.50245', '-0.05722', 'SOLU', 'SET', 'RFZ=3.3', 'TFZ=4.2', 'PAK=0', 'LLG=30', 'SOLU', '6DIM', 'ENSE', '200la_', 'EULER', '329.492', '34.325', '209.775', 'FRAC', '0.70297', '0.00106', '-0.24023', 'SOLU', 'SET', 'RFZ=3.6', 'TFZ=3.6', 'PAK=0', 'LLG=30', 'SOLU', '6DIM', 'ENSE', '200la_', 'EULER', '177.344', '78.287', '187.356', 'FRAC', '0.04890', '0.00090', '-0.57497']
for element in all:
    try:
        floats.append(float(element))
    except ValueError:
        pass
def is_float(i):
        try:
            float(i)
            return True
        except ValueError:
            return False


L=['#', '3e98.mtz', 'MR_AUTO', 'with', 'model', '200la_.pdb', 'SPACegroup', 'HALL', 'P', '2yb', '#P', '1', '21', '1', 'SOLU', 'SET', 'RFZ=3.0', 'TFZ=4.7', 'PAK=0', 'LLG=30', 'SOLU', '6DIM', 'ENSE', '200la_', 'EULER', '321.997', '124.066', '234.744', 'FRAC', '-0.14681', '0.50245', '-0.05722', 'SOLU', 'SET', 'RFZ=3.3', 'TFZ=4.2', 'PAK=0', 'LLG=30', 'SOLU', '6DIM', 'ENSE', '200la_', 'EULER', '329.492', '34.325', '209.775', 'FRAC', '0.70297', '0.00106', '-0.24023', 'SOLU', 'SET', 'RFZ=3.6', 'TFZ=3.6', 'PAK=0', 'LLG=30', 'SOLU', '6DIM', 'ENSE', '200la_', 'EULER', '177.344', '78.287', '187.356', 'FRAC', '0.04890', '0.00090', '-0.57497']
print filter(is_float,L)

If you display your input in a manner that discourages answerers from examining its structure, and you ask questions like "how do I extract only the floating point numbers", and bury useful information like "In each file that I am looking at, there is 6 numbers like that in each line" in comments, you will get knee-jerk answers providing exactly what you asked for: a list of "floats" that includes 3 spurious numbers (1.0, 21.0, and 1.0) at the front of the list.

If you display your data in a slightly more congenial fashion, like:

alist = [
    '#', '3e98.mtz', 'MR_AUTO', 'with', 'model', '200la_.pdb', 'SPACegroup', 'HALL', 'P', '2yb',
    '#P', '1', '21', '1', 
    'SOLU', 'SET', 'RFZ=3.0', 'TFZ=4.7', 'PAK=0', 'LLG=30', 'SOLU', '6DIM', 'ENSE', '200la_',
        'EULER', '321.997', '124.066', '234.744', 'FRAC', '-0.14681', '0.50245', '-0.05722',
    'SOLU', 'SET', 'RFZ=3.3', 'TFZ=4.2', 'PAK=0', 'LLG=30', 'SOLU', '6DIM', 'ENSE', '200la_',
        'EULER', '329.492', '34.325', '209.775', 'FRAC', '0.70297', '0.00106', '-0.24023',
    'SOLU', 'SET', 'RFZ=3.6', 'TFZ=3.6', 'PAK=0', 'LLG=30', 'SOLU', '6DIM', 'ENSE', '200la_', 
        'EULER', '177.344', '78.287', '187.356', 'FRAC', '0.04890', '0.00090', '-0.57497'
    ]

there is some chance that people will notice the structure (EULER followed by three numbers then FRAC followed by three numbers) repeated and go "Oho, six numbers per line in his file" and come back with some more useful advice, like:

Start at the beginning, tell us what your file structure is. There is likely to be a better way of getting your information than smashing your file into a list of strings and then attempting to recover from that.

Update In the meantime, here is an answer that uses the structure that is evident in your data and comments and will be more debuggable if there are variations in the structure:

TAG0 = 'EULER'
TAG1 = 'FRAC'

def extract_rows(tokens):
    pos = 0
    while True:
        try:
            pos = tokens.index(TAG0, pos)
        except ValueError:
            return
        assert pos + 8 <= len(tokens)
        assert tokens[pos+4] == TAG1
        yield (
            tuple(map(float, tokens[pos+1:pos+4])),
            tuple(map(float, tokens[pos+5:pos+8])),
            )
        pos += 8

for rowx, row in enumerate (extract_rows(alist)):
    print rowx, 'TAG0', row[0]
    print rowx, 'TAG1', row[1]

Results:

0 TAG0 (321.99700000000001, 124.066, 234.744)
0 TAG1 (-0.14681, 0.50244999999999995, -0.05722)
1 TAG0 (329.49200000000002, 34.325000000000003, 209.77500000000001)
1 TAG1 (0.70296999999999998, 0.00106, -0.24023)
2 TAG0 (177.34399999999999, 78.287000000000006, 187.35599999999999)
2 TAG1 (0.048899999999999999, 0.00089999999999999998, -0.57496999999999998)

Update 2 Based on your example file, the following simple code (untested) should do what you want:

for line in open('my_file.txt'):
    row = line.split()
    if row[0] == 'SOLU' and row[1] == '6DIM' and row[4] == 'EULER' and row[8] == 'FRAC':
        euler = map(float, row[5:8])
        frac = map(float, row[9:12])
        do_something_with(euler, frac)

Note: it's only a coincidence that what you are looking for is "all of the floating point numbers" (which ignores the floating point numbers in RFZ=3.0 TFZ=4.7 anyway!). What you have is a file with STRUCTURE: two types of SOLU records, and you want the 3 numbers that appear after EULER and the 3 after FRAC in the SOLU 6DIM records. You DON'T want a list of all of those numbers and have to split them up again into (3 EULER numbers and 3 FRAC numbers) times N.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM