[英]How to extract numbers from a text file with appropriate labels in python
boundary
layer 2
datatype 0
xy 15 525270 8663518 525400 8663518 525400 8664818 525660 8664818
525660 8663518 525790 8663518 525790 8664818 526050 8664818
526050 8663518 526180 8663518 526180 8665398 525980 8665598
525470 8665598 525270 8665398 525270 8663518
endel
I have coordinates of polygons in this format shown above. 我有上面显示的这种格式的多边形坐标。 Each polygon starts with "boundary" and ends with "endel". 每个多边形以“边界”开始,以“ endel”结束。 I am having trouble extracting the layer number, number of points, and the coordinates into either a numpy array or a pandas dataframe. 我在将层号,点数和坐标提取到numpy数组或pandas数据框中时遇到麻烦。
To be specific to this example, I need the layer number (2), number of points (15), and the xy coordinate pairs. 为了特定于此示例,我需要层号(2),点数(15)和xy坐标对。
with open('source1.txt', encoding="utf-8") as f:
for line in f:
line = f.readline()
srs= line.split("\t")
print(srs)
Doing this doesnt split the numbers even thoe they are separated by tabs 这样做即使数字被制表符分隔也不会拆分数字
[' layer 255\n']
[' xy 5 0 0 22800000 0 22800000 22800000 0 22800000\n']
[' endel\n']
This is the result i got with that 这是我得到的结果
with open('source1.txt', encoding="utf-8") as f:
for line in f:
line = f.readline()
srs= line.split(" ")
print(srs)
This isnt what i wanted but i tried that too and yet got a bad split 这不是我想要的,但是我也尝试过,但是分裂不好
['', '', '', '', '', '', '', '', 'layer', '255\n']
['', '', '', '', '', '', '', '', 'xy', '', '', '5', '', '', '0', '0', '', '', '22800000', '0', '', '', '22800000', '22800000', '', '', '0', '22800000\n']
['', '', '', '', '', '', '', '', 'endel\n']
I couldnt go to numpy part as im stuck in processing the string from the file 我无法进入numpy部分,因为我无法处理文件中的字符串
Edited as per request 根据要求编辑
You could use some trivial code such as: 您可以使用一些简单的代码,例如:
res = []
coords = []
xy = False
with open('data.txt') as f:
for line in f.readlines():
if 'layer' in line:
arr = line.split()
layer = int(arr[-1].strip())
elif 'xy' in line:
arr = line.split()
npoints = int(arr[1])
coords = arr[2:]
xy = True
elif 'endel' in line:
res.append([layer, npoints, coords[0:npoints]])
xy = False
coords = []
elif xy:
coords.extend(line.split())
print(res)
Then, you can convert the resulting list to numpy array, or whatever you like, but note that coords are still strings in the code above. 然后,您可以将结果列表转换为numpy数组或任何您喜欢的内容,但是请注意,在上面的代码中,coords仍然是字符串。
You can use a regex to parse that file into blocks of the relevant data then parse each block: 您可以使用正则表达式将该文件解析为相关数据的块,然后解析每个块:
for block in re.findall(r'^boundary([\s\S]+?)endel', f.read()):
m1=re.search(r'^\s*layer\s+(\d+)', block, re.M)
m2=re.search(r'^\s*datatype\s+(\d+)', block, re.M)
m3=re.search(r'^\s*xy\s+(\d+)\s+([\s\d]+)', block, re.M)
if m1 and m2 and m3:
layer=int(m1.group(1))
datatype=int(m2.group(1))
xy=int(m3.group(1))
coordinates=[(int(x),int(y)) for x,y in zip(*[iter(m3.group(2).split())]*2)]
else:
print "can't parse {}".format(block)
A variable number of coordinates are supported after the xy
and it is trivial to test if the number of coordinates parsed is the number expected with len(coordinates)==xy
. xy
之后支持可变数量的坐标,这很简单,可以测试解析的坐标数量是否为len(coordinates)==xy
期望的数量。
As written, this requires reading the entire file into memory. 按照书面要求,这需要将整个文件读入内存。 If size is an issues, (and it usually is not for small to moderate size files), you can use mmap
to make the file appear to be in memory. 如果大小是一个问题,(通常不适用于中小尺寸的文件),则可以使用mmap
使文件看起来好像在内存中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.