如何从带有python中适当标签的文本文件中提取数字

Question

boundary
        layer 2
        datatype 0
        xy  15   525270 8663518   525400 8663518   525400 8664818   525660 8664818
                 525660 8663518   525790 8663518   525790 8664818   526050 8664818
                 526050 8663518   526180 8663518   526180 8665398   525980 8665598
                 525470 8665598   525270 8665398   525270 8663518
        endel

I have coordinates of polygons in this format shown above. 我有上面显示的这种格式的多边形坐标。 Each polygon starts with "boundary" and ends with "endel". 每个多边形以“边界”开始，以“ endel”结束。 I am having trouble extracting the layer number, number of points, and the coordinates into either a numpy array or a pandas dataframe. 我在将层号，点数和坐标提取到numpy数组或pandas数据框中时遇到麻烦。

To be specific to this example, I need the layer number (2), number of points (15), and the xy coordinate pairs. 为了特定于此示例，我需要层号（2），点数（15）和xy坐标对。

with open('source1.txt', encoding="utf-8") as f:
    for line in f:
        line = f.readline()
        srs= line.split("\t")
        print(srs)

Doing this doesnt split the numbers even thoe they are separated by tabs 这样做即使数字被制表符分隔也不会拆分数字

['        layer 255\n']
['        xy   5   0 0   22800000 0   22800000 22800000   0 22800000\n']
['        endel\n']

This is the result i got with that 这是我得到的结果

with open('source1.txt', encoding="utf-8") as f:
    for line in f:
        line = f.readline()
        srs= line.split(" ")
        print(srs)

This isnt what i wanted but i tried that too and yet got a bad split 这不是我想要的，但是我也尝试过，但是分裂不好

['', '', '', '', '', '', '', '', 'layer', '255\n']
['', '', '', '', '', '', '', '', 'xy', '', '', '5', '', '', '0', '0', '', '', '22800000', '0', '', '', '22800000', '22800000', '', '', '0', '22800000\n']
['', '', '', '', '', '', '', '', 'endel\n']

I couldnt go to numpy part as im stuck in processing the string from the file 我无法进入numpy部分，因为我无法处理文件中的字符串

Edited as per request 根据要求编辑

Answer 1

You could use some trivial code such as: 您可以使用一些简单的代码，例如：

res = []
coords = []
xy = False
with open('data.txt') as f:
    for line in f.readlines():
        if 'layer' in line:
            arr = line.split()
            layer = int(arr[-1].strip())
        elif 'xy' in line:
            arr = line.split()
            npoints = int(arr[1])
            coords = arr[2:]
            xy = True
        elif 'endel' in line:
            res.append([layer, npoints, coords[0:npoints]])
            xy = False
            coords = []
        elif xy:
            coords.extend(line.split())
print(res)

Then, you can convert the resulting list to numpy array, or whatever you like, but note that coords are still strings in the code above. 然后，您可以将结果列表转换为numpy数组或任何您喜欢的内容，但是请注意，在上面的代码中，coords仍然是字符串。

Answer 2

You can use a regex to parse that file into blocks of the relevant data then parse each block: 您可以使用正则表达式将该文件解析为相关数据的块，然后解析每个块：

for block in re.findall(r'^boundary([\s\S]+?)endel', f.read()):
    m1=re.search(r'^\s*layer\s+(\d+)', block, re.M)
    m2=re.search(r'^\s*datatype\s+(\d+)', block, re.M)
    m3=re.search(r'^\s*xy\s+(\d+)\s+([\s\d]+)', block, re.M)
    if m1 and m2 and m3:
        layer=int(m1.group(1))
        datatype=int(m2.group(1))
        xy=int(m3.group(1))
        coordinates=[(int(x),int(y)) for x,y in zip(*[iter(m3.group(2).split())]*2)]
    else:
        print "can't parse {}".format(block)

A variable number of coordinates are supported after the xy and it is trivial to test if the number of coordinates parsed is the number expected with len(coordinates)==xy . xy之后支持可变数量的坐标，这很简单，可以测试解析的坐标数量是否为len(coordinates)==xy期望的数量。

As written, this requires reading the entire file into memory. 按照书面要求，这需要将整个文件读入内存。 If size is an issues, (and it usually is not for small to moderate size files), you can use mmap to make the file appear to be in memory. 如果大小是一个问题，（通常不适用于中小尺寸的文件），则可以使用mmap使文件看起来好像在内存中。

如何从带有python中适当标签的文本文件中提取数字

问题描述

2 个解决方案

解决方案1
1 已采纳 2018-01-06 15:32:24

解决方案2
0 2018-01-06 16:22:46

如何从带有python中适当标签的文本文件中提取数字

问题描述

2 个解决方案

解决方案1 1 已采纳 2018-01-06 15:32:24

解决方案2 0 2018-01-06 16:22:46

解决方案1
1 已采纳 2018-01-06 15:32:24

解决方案2
0 2018-01-06 16:22:46