简体   繁体   English

如何在文件中进行多行字符串搜索并获取python中的起始行,结束行信息?

[英]How to do multi-line string search in a file and get start line, end line info in python?

I want to search for multi-line string in a file in python. If there is a match, then I want to get the start line number, end line number, start column and end column number of the match.我想在python中的文件中搜索多行字符串。如果匹配,则我想获取匹配的起始行号、结束行号、起始列和结束列号。 For example: in the below file ,例如:在下面的文件中,

在此处输入图像描述

I want to match the below multi-line string:我想匹配下面的多行字符串:

pattern = """b'0100000001685c7c35aabe690cc99f947a8172ad075d4401448a212b9f26607d6ec5530915010000006a4730'
           b'440220337117278ee2fc7ae222ec1547b3a40fa39a05f91c1e19db60060541c4b3d6e4022020188e1d5d843c'"""

The result of the match should be as: start_line : 2, end_line = 3, start_column : 23 and end_column : 114匹配结果应为: start_line : 2, end_line = 3, start_column : 23 and end_column : 114

The start column is the index in that line where the first character is matched of the pattern and end column is the last index of the line where the last character is matched of the pattern.起始列是模式的第一个字符匹配的行中的索引,结束列是模式的最后一个字符匹配的行的最后一个索引。 The end column is shown below:结束栏如下图所示:

在此处输入图像描述

I tried with the re package of python but it returns None as it could not find any match.我尝试使用 python 的re package ,但它返回None ,因为它找不到任何匹配项。

import re

pattern = """b'0100000001685c7c35aabe690cc99f947a8172ad075d4401448a212b9f26607d6ec5530915010000006a4730'
           b'440220337117278ee2fc7ae222ec1547b3a40fa39a05f91c1e19db60060541c4b3d6e4022020188e1d5d843c'"""
                       

with open("test.py") as f:

  content = f.read()

  print(re.search(pattern, content))

I can find the metadata of the location of the match of a single line strings in a file using我可以找到文件中单行字符串匹配位置的元数据

with open("test.py") as f:
  data = f.read()
  for n, line in enumerate(data):
    match_index = line.find(pattern)
    if match_index != -1:
      print("Start Line:", n + 1)
      print("End Line", n + 1)
      print("Start Column:", match_index)
      print("End Column:", match_index + len(pattern) + 1)
      break

But, I am struggling to make it work for multi-line strings.但是,我正在努力让它适用于多行字符串。 How can I match multi-line strings in a file and get the metadata of the location of the match in python?如何匹配文件中的多行字符串并获取python中匹配位置的元数据?

You should use the re.MULTILINE flag to search multiple lines您应该使用re.MULTILINE标志来搜索多行

import re
pattern = r"(c\nd)"
string = """
a
b
c
d
e
f
"""

match = re.search(pattern, string, flags=re.MULTILINE)
print(match)

To get the start line, you could count the newline characters as follows要获得起始行,您可以按如下方式计算换行符

start, stop = match.span()
start_line = string[:start].count('\n')

You could do the same for the end_line , or if you know how many lines is your pattern, you can just add this info to avoid counting twice.您可以对end_line执行相同的操作,或者如果您知道您的模式有多少行,您可以只添加此信息以避免计数两次。

To also get the start column, you can check the line itself, or a pure regex solution could also look line:要同时获取开始列,您可以检查行本身,或者纯正则表达式解决方案也可以查看行:

pattern = "(?:.*\n)*(\s*(c\s*\n\s*d)\s*)"
match = re.match(pattern, string, flags=re.MULTILINE)
start_column = match.start(2) - match.start(1)
start_line = string[:match.start(1)].count('\n')
print(start_line, start_column)

However, I think difflib could be more useful here.但是,我认为difflib在这里可能更有用。

Alternative Solution替代解决方案

Below, I got a more creative solution to your problem: You are interested in the row and column position of some sample text (not a pattern, but a fixed text) in a larger text.下面,我为您的问题提供了一个更有创意的解决方案:您对较大文本中的一些示例文本(不是模式,而是固定文本)的行和列 position 感兴趣。 This problem reminds me a lot on image registration, see https://en.wikipedia.org/wiki/Digital_image_correlation_and_tracking for a short introduction or https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.correlate2d.html for a more sophisticated example.这个问题让我想起了很多关于图像注册的问题,请参阅https://en.wikipedia.org/wiki/Digital_image_correlation_and_trackinghttps://docs.scipy.org/doc/scipy/reference/generated/scipy.signal .correlate2d.html是一个更复杂的例子。

import os
from itertools import zip_longest

import numpy as np

text = """Some Title
abc xyz ijk
  12345678
  abcdefgh
xxxxxxxxxxx
012345678
abcabcabc
yyyyyyyyyyy
"""
template = (
    "12345678",
    "abcdefgh"
)
moving = np.array([
    [ord(char) for char in line]
    for line in template
])
lines = text.split(os.linesep)
values = [
    [ord(char) for char in line]
    for line in lines
]
# use zip longest, to pad array with fill value
reference = np.array(list(zip_longest(*values, fillvalue=0))).T
windows = np.lib.stride_tricks.sliding_window_view(reference, moving.shape)
# get a distance matrix
distance = np.linalg.norm(windows - moving, axis=(2, 3))
# find minimum and retrun index location
row, column = np.unravel_index(np.argmin(distance), distance.shape)
print(row, column)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM