简体   繁体   English

使用python从文本文件导入数据

[英]Importing data from a text file using python

I have a text file containing data in rows and columns (~17000 rows in total). 我有一个包含行和列数据的文本文件(总共约17000行)。 Each column is a uniform number of characters long, with the 'unused' characters filled in by spaces. 每列的长度都是统一的字符数,“未使用的”字符用空格填充。 For example, the first column is 11 characters long, but the last four characters in that column are always spaces (so that it appears to be a nice column when viewed with a text editor). 例如,第一列长度为11个字符,但该列中的最后四个字符始终为空格(因此当使用文本编辑器查看时,它看起来是一个很好的列)。 Sometimes it's more than four if the entry is less than 7 characters. 如果条目小于7个字符,有时它超过4个。

The columns are not otherwise separated by commas, tabs, or spaces. 这些列不以逗号,制表符或空格分隔。 They are also not all the same number of characters (the first two are 11, the next two are 8 and the last one is 5 - but again, some are spaces). 它们也不是所有相同数量的字符(前两个是11,接下来的两个是8,最后一个是5 - 但是有些是空格)。

What I want to do is import the entires (which are numbers) in the last two columns if the second column contains the string 'OW' somewhere in it. 如果第二列在其中包含字符串'OW',我想要做的是导入最后两列中的entires(数字)。 Any help would be greatly appreciated. 任何帮助将不胜感激。

Python's struct.unpack is probably the quickest way to split fixed-length fields. Python的struct.unpack可能是分割固定长度字段的最快方法。 Here's a function that will lazily read your file and return tuples of numbers that match your criteria: 这是一个懒惰地读取您的文件并返回符合您条件的数字元组的函数:

import struct

def parsefile(filename):
    with open(filename) as myfile:
        for line in myfile:
            line = line.rstrip('\n')
            fields = struct.unpack('11s11s8s8s5s', line)
            if 'OW' in fields[1]:
                yield (int(fields[3]), int(fields[4]))

Usage: 用法:

if __name__ == '__main__':
    for field in parsefile('file.txt'):
        print field

Test data: 测试数据:

1234567890a1234567890a123456781234567812345
something  maybe OW d 111111118888888855555
aaaaa      bbbbb      1234    1212121233333
other thinganother OW 121212  6666666644444

Output: 输出:

(88888888, 55555)
(66666666, 44444)

In Python you can extract a substring at known positions using a slice - this is normally done with the list[start:end] syntax. 在Python中,您可以使用切片在已知位置提取子字符串 - 这通常使用list [start:end]语法来完成。 However you can also create slice objects that you can use later to do the indexing. 但是,您也可以创建切片对象,以后可以使用它们来进行索引。

So you can do something like this: 所以你可以这样做:

columns = [slice(11,22), slice(30,38), slice(38,44)]

myfile = open('some/file/path')
for line in myfile:
    fields = [line[column].strip() for column in columns]
    if "OW" in fields[0]:
        value1 = int(fields[1])
        value12 = int(fields[2]) 
        ....

Separating out the slices into a list makes it easy to change the code if the data format changes, or you need to do stuff with the other fields. 将切片分离成列表可以在数据格式发生变化时轻松更改代码,或者您需要对其他字段进行操作。

entries = ((float(line[30:38]), float(line[38:43])) for line in myfile if "OW" in line[11:22])

for num1, num2 in entries:
  # whatever

Here's a function which might help you: 这是一个可以帮助您的功能:

def rows(f, columnSizes):
    while True:
        row = {}
        for (key, size) in columnSizes:
            value = f.read(size)
            if len(value) < size: # EOF
                return
            row[key] = value
        yield row

for an example of how it's used: 有关如何使用它的示例:

from StringIO import StringIO

sample = StringIO("""aaabbbccc
d  e  f  
g  h  i  
""")

for row in rows(sample, [('first', 3),
                         ('second', 3),
                         ('third', 4)]):
    print repr(row)

Note that unlike the other answers, this example is not line-delimited (it uses the file purely as a provider of bytes, not an iterator of lines), since you specifically mentioned that the fields were not separated, I assumed that the rows might not be either; 请注意,与其他答案不同,此示例不是行分隔的(它仅将文件用作字节的提供程序,而不是行的迭代器),因为您特别提到字段未分隔,我假设行可能也不是; the newline is taken into account specifically. 特别考虑了换行符。

You can test if one string is a substring of another with the 'in' operator. 您可以使用'in'运算符测试一个字符串是否是另一个字符串的子字符串。 For example, 例如,

>>> 'OW' in 'hello'
False
>>> 'OW' in 'helOWlo'
True

So in this case, you might do 所以在这种情况下,你可能会这样做

if 'OW' in row['third']:
    stuff()

but you can obviously test any field for any value as you see fit. 但你可以明显地测试任何字段的任何值,如你认为合适。

entries = []
with open('my_file.txt', 'r') as f:
  for line in f.read().splitlines()
    line = line.split()
    if line[1].find('OW') >= 0
      entries.append( ( int(line[-2]) , int(line[-1]) ) )

entries is an array containing tuples of the last two entries entries是一个包含最后两个条目的元组的数组

edit: oops 编辑:oops

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM