简体   繁体   English

从文本文件读取数据并将数据写入python中的numpy列

[英]Read and write data from text file to numpy column in python

I've been struggling to get something to work for the following text file format. 我一直在努力使以下文本文件格式起作用。 My overall goal is to extract the value for one of the variable names throughout the entire text file. 我的总体目标是提取整个文本文件中变量名之一的值。 For example, I want all the values for B rows and D rows. 例如,我想要B行和D行的所有值。 Then put them in a normal numpy array and run calculations. 然后将它们放在普通的numpy数组中并运行计算。

Here is what the data file looks like: 数据文件如下所示:

[SECTION1a]
[a] 1424457484310
[b] 5313402937
[c] 873348378938
[d] 882992596992
[e] 14957596088
[SECTION1b]
243 62 184 145 250 180 106 208 248 87 186 137 127 204 18 142 37 67 36 72 48     204 255 30 243 78 44 121 112 139 76 71 131 50 118 10 42 8 67 4 98 110 37 5 208   104 56 55 225 56 0 102 0 21 0 156 0 174 255 171 0 42 0 233 0 50 0 254 0 245 255   110 
[END SECTION1]
[SECTION2a]
[a] 1424457484310
[b] 5313402937
[c] 873348378938
[d] 882992596992
[e] 14957596088
[SECTION2b]
243 62 184 145 250 180 106 208 248 87 186 137 127 204 18 142 37 67 36 72 48   204 255 30 243 78 44 121 112 139 76 71 131 50 118 10 42 8 67 4 98 110 37 5 208 104 56 55 225 56 0 102 0 21 0 156 0 174 255 171 0 42 0 233 0 50 0 254 0 245 255 110 
[END SECTION2]

That pattern continues for N sections. 该模式持续N个部分。

Currently I read the file and put it into two columns: 目前,我已读取文件并将其分为两列:

filename_load = fileopenbox(msg=None, title='Load Data File',
                        default="Z:\*",
                        filetypes=None)

col1_data = np.genfromtxt(filename_load, skip_header=1, dtype=None, 
usecols=(0,), usemask=True, invalid_raise=False)

col2_data = np.genfromtxt(filename_load, skip_header=1, dtype=None, 
usecols=(1,), usemask=True, invalid_raise=False)

I was going to then use where, to find the index of the value I wanted, then make a new array of those values: 然后,我将使用where来查找所需值的索引,然后创建这些值的新数组:

arr_index = np.where(col1_data == '[b]')
new_array = col2_data[arr_index]

Problem with that is, I end up with arrays of two different sizes because of the weird file format so obviously the data in the array won't match up properly to the right variable name. 问题在于,由于文件格式怪异,我最终得到了两个不同大小的数组,因此很显然,数组中的数据无法正确匹配正确的变量名。

I have tried a few other alternatives and get stuck due to the weird text file format and how to read it into python. 由于其他奇怪的文本文件格式以及如何将其读入python,我尝试了其他几种选择并陷入困境。

Not sure if I should stay on this track an if so how to address the problem, or, try a totally different approach. 不知道我是否应该坚持下去,如果可以的话,如何解决这个问题,或者尝试一种完全不同的方法。

Thanks in advance! 提前致谢!

A possible solution sorting your data into hierachy of OrdedDict() dictionaries: 一种可能的解决方案,将您的数据分类为OrdedDict()字典的OrdedDict()

from collections import OrderedDict
import re


ss = """[SECTION1a]
[a] 1424457484310
[b] 5313402937
[c] 873348378938
[d] 882992596992
[e] 14957596088
[SECTION1b]
243 62 184 145 250 180 106 208 248 87 186 137 127 204 18 142 37 67 36 72 48     204 255 30 243 78 44 121 112 139 76 71 131 50 118 10 42 8 67 4 98 110 37 5 208   104 56 55 225 56 0 102 0 21 0 156 0 174 255 171 0 42 0 233 0 50 0 254 0 245 255   110
[END SECTION1]
[SECTION2a]
[a] 1424457484310
[b] 5313402937
[c] 873348378938
[d] 882992596992
[e] 14957596088
[SECTION2b]
243 62 184 145 250 180 106 208 248 87 186 137 127 204 18 142 37 67 36 72 48   204 255 30 243 78 44 121 112 139 76 71 131 50 118 10 42 8 67 4 98 110 37 5 208 104 56 55 225 56 0 102 0 21 0 156 0 174 255 171 0 42 0 233 0 50 0 254 0 245 255 110
[END SECTION2]"""

# regular expressions for matching SECTIONs
p1 = re.compile("^\[SECTION[0-9]+a\]")
p2 = re.compile("^\[SECTION[0-9]+b\]")
p3 = re.compile("^\[END SECTION[0-9]+\]")

def parse(ss):
    """ Make hierachial dict from string """
    ll, l_cnt = ss.splitlines(), 0
    d = OrderedDict()
    while l_cnt < len(ll): # iterate through lines
        l = ll[l_cnt].strip()
        if p1.match(l):  # new sub dict for [SECTION*a]
            dd, nn = OrderedDict(), l[1:-1]
            l_cnt += 1
            while (p2.match(ll[l_cnt].strip()) is None and
                   p3.match(ll[l_cnt].strip()) is None):
                ww = ll[l_cnt].split()
                dd[ww[0][1:-1]] = int(ww[1])
                l_cnt += 1
            d[nn] = dd
        elif p2.match(l):  # array of ints for [SECTION*b]
            d[l[1:-1]] = [int(w) for w in ll[l_cnt+1].split()]
            l_cnt += 2
        elif p3.match(l):
            l_cnt += 1
    return d

dd = parse(ss)

Note that you can get much more robust code, if you use an existing parsing tool (eg, Parsley ). 请注意,如果您使用现有的解析工具(例如Parsley ),则可以获得更强大的代码。

To retrieve '[c]' from all sections, do: 要从所有部分中检索'[c]' ,请执行以下操作:

print("All entries for [c]: ", end="")
cc = [d['c'] for s,d in dd.items() if s.endswith('a')]
print(", ".join(["{}".format(c) for c in cc]))    
# Gives: All entries for [c]: 873348378938, 873348378938

Or you could traverse the whole dictionary: 或者您可以遍历整个字典:

def print_recdicts(d, tbw=0):
    """print the hierachial dict """
    for k,v in d.items():
        if type(v) is OrderedDict:
            print(" "*tbw + "* {}:".format(k))
            print_recdicts(v, tbw+2)
        else:
            print(" "*tbw + "* {}: {}".format(k,v))

print_recdicts(dd)
# Gives:
# * SECTION1a:
#   * a: 1424457484310
#   * b: 5313402937
# ...

The following should do it. 下面应该这样做。 It uses a running store ( tally ) to cope with missing values, then writes the state out when hitting the end marker. 它使用运行中的存储库( tally )来处理缺少的值,然后在击中结束标记时将状态写出。

import re
import numpy as np

filename = "yourfilenamehere.txt"

# [e] 14957596088
match_line_re = re.compile(r"^\[([a-z])\]\W(\d*)")

result = {
    'b':[],
    'd':[],
    }

tally_empty = dict( zip( result.keys(), [np.nan] * len(result) ) )

tally = tally_empty
with open(filename, 'r') as f:
    for line in f:
        if line.startswith('[END SECTION'):
            # Write accumulated data to the lists
            for k, v in tally.items():
                result[k].append(v)

            tally = tally_empty 

        else:
            # Map the items using regex
            m = match_line_re.search(line)
            if m:
                k, v = m.group(1), m.group(2)
                print(k,v)
                if k in tally:
                    tally[k] = v

b = np.array(result['b'])
d = np.array(result['d'])

Note, whatever keys are in the result dict definition will be in the output. 注意,结果dict定义中的任何键都将出现在输出中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM