简体   繁体   中英

Importing data from a text file using python

I have a text file containing data in rows and columns (~17000 rows in total). Each column is a uniform number of characters long, with the 'unused' characters filled in by spaces. For example, the first column is 11 characters long, but the last four characters in that column are always spaces (so that it appears to be a nice column when viewed with a text editor). Sometimes it's more than four if the entry is less than 7 characters.

The columns are not otherwise separated by commas, tabs, or spaces. They are also not all the same number of characters (the first two are 11, the next two are 8 and the last one is 5 - but again, some are spaces).

What I want to do is import the entires (which are numbers) in the last two columns if the second column contains the string 'OW' somewhere in it. Any help would be greatly appreciated.

Python's struct.unpack is probably the quickest way to split fixed-length fields. Here's a function that will lazily read your file and return tuples of numbers that match your criteria:

import struct

def parsefile(filename):
    with open(filename) as myfile:
        for line in myfile:
            line = line.rstrip('\n')
            fields = struct.unpack('11s11s8s8s5s', line)
            if 'OW' in fields[1]:
                yield (int(fields[3]), int(fields[4]))

Usage:

if __name__ == '__main__':
    for field in parsefile('file.txt'):
        print field

Test data:

1234567890a1234567890a123456781234567812345
something  maybe OW d 111111118888888855555
aaaaa      bbbbb      1234    1212121233333
other thinganother OW 121212  6666666644444

Output:

(88888888, 55555)
(66666666, 44444)

In Python you can extract a substring at known positions using a slice - this is normally done with the list[start:end] syntax. However you can also create slice objects that you can use later to do the indexing.

So you can do something like this:

columns = [slice(11,22), slice(30,38), slice(38,44)]

myfile = open('some/file/path')
for line in myfile:
    fields = [line[column].strip() for column in columns]
    if "OW" in fields[0]:
        value1 = int(fields[1])
        value12 = int(fields[2]) 
        ....

Separating out the slices into a list makes it easy to change the code if the data format changes, or you need to do stuff with the other fields.

entries = ((float(line[30:38]), float(line[38:43])) for line in myfile if "OW" in line[11:22])

for num1, num2 in entries:
  # whatever

Here's a function which might help you:

def rows(f, columnSizes):
    while True:
        row = {}
        for (key, size) in columnSizes:
            value = f.read(size)
            if len(value) < size: # EOF
                return
            row[key] = value
        yield row

for an example of how it's used:

from StringIO import StringIO

sample = StringIO("""aaabbbccc
d  e  f  
g  h  i  
""")

for row in rows(sample, [('first', 3),
                         ('second', 3),
                         ('third', 4)]):
    print repr(row)

Note that unlike the other answers, this example is not line-delimited (it uses the file purely as a provider of bytes, not an iterator of lines), since you specifically mentioned that the fields were not separated, I assumed that the rows might not be either; the newline is taken into account specifically.

You can test if one string is a substring of another with the 'in' operator. For example,

>>> 'OW' in 'hello'
False
>>> 'OW' in 'helOWlo'
True

So in this case, you might do

if 'OW' in row['third']:
    stuff()

but you can obviously test any field for any value as you see fit.

entries = []
with open('my_file.txt', 'r') as f:
  for line in f.read().splitlines()
    line = line.split()
    if line[1].find('OW') >= 0
      entries.append( ( int(line[-2]) , int(line[-1]) ) )

entries is an array containing tuples of the last two entries

edit: oops

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM