简体   繁体   English

正则表达式:字符串开头不符合预期

[英]Regex: Start of string does not behave as expected

I am parsing (by regex) a .map file generated by a linker for ARM. 我正在解析(通过正则表达式)由ARM链接程序生成的.map文件。 I have extracted pretty much everything but this section is resisting. 我已经提取了几乎所有内容,但是本节内容有所抵触。

Here is an excerpt of the part I want to parse 这是我要解析的部分的摘录

COMMON         0x20002b18        0x1 ./2_Programa/source/board.o
               0x20002b18                BOARD_ctx
COMMON         0x20002b19       0x87 ./2_Programa/source/interface_objects.o
               0x20002b19                GLB_appIntObjPropChangeFlags
               0x20002b1a                GLB_aioBLCommand
               0x20002b65                GLB_aioDateTime
COMMON         0x20002ba0       0x31 ./2_Programa/source/objects.o
               0x20002ba0                GLB_goFlags
*fill*         0x20002bd1        0x3 

and this is my best regex attempt: 这是我最好的正则表达式尝试:

^ COMMON\s+(0x\S+)\s+(0x\S+).*(?:\s+(0x\S+)\s+(\S+)[\r\n])*(?:\s+\*fill\*\s+0x\S+\s+(0x\S+))?

The result can be checked here . 可在此处检查结果 The result I get, only matches the last line of the block (I consider a block when it starts with COMMON ). 我得到的结果仅与该块的最后一行匹配(当它以COMMON开头时,我认为是一个块)。

What I need to extract is something similar to this: 我需要提取的内容与此类似:

[{
    'name': 'GLB_appIntObjPropChangeFlags',
    'size': 0x01,
    'path': './2_Programa/source/interface_objects.o',
    'origin': 0x20002b19
},
{
    'name': 'GLB_aioBLCommand',
    'size': 0x87,
    'path': './2_Programa/source/interface_objects.o',
    'origin': 0x20002b1a
},
...
]

My main problem here is that I am not able to separate the first line 我的主要问题是我无法分隔第一行

COMMON 0x20002b19 0x87 ./2_Programa/source/interface_objects.o`

from the others related to it 与之相关的其他人

        0x20002b19                GLB_appIntObjPropChangeFlags
        0x20002b1a                GLB_aioBLCommand
        0x20002b65                GLB_aioDateTime

Could anyone give some hints to face this off? 任何人都可以给一些提示以解决这个问题吗?

UPDATE 更新

What I would like to do is to split all blocks (those that start with COMMON ) into two parts. 我想做的是将所有块(以COMMON开头的块)分成两部分。 Group 1: 第一组:

COMMON 0x20002b19 0x87 ./2_Programa/source/interface_objects.o`

and Group2: 和第2组:

        0x20002b19                GLB_appIntObjPropChangeFlags
        0x20002b1a                GLB_aioBLCommand
        0x20002b65                GLB_aioDateTime

Then, I could regex each group separately: 然后,我可以分别对每个组进行正则表达式:

Regex for Group 1: 第1组的正则表达式:

^ COMMON\s+(0x\S+)\s+(\S+)\s+(\S+)

and this other for Group 2 (setting multi line flag): 第二组(设置多行标志):

^\s+(0x\S+)\s+(\S+)

As a result I will get three groups from first regex and other six (2 per line per 3 lines) which could easily converted in a list of dict s as I showed above. 结果,我将从第一个正则表达式中获得三组,而其他六组(每行每三行2个)可以很容易地转换成如上所示的dict列表。

Brief 简要

Realistically, you should grab each COMMON block as mentioned in the comments under your question by Wiktor Stribiżew . 实际上,您应该抓住WiktorStribiżew在您的问题下的注释中提到的每个COMMON块。 Link to Wiktor's regex here . 在此处链接到Wiktor的正则表达式。 Regex does not have the ability to loop over a subquery (that's not its purpose). 正则表达式没有能力遍历子查询(这不是其目的)。

Impractically, you can use this regex to grab each COMMON section and its following blocks, and then map it. 不切实际地,您可以使用此正则表达式获取每个COMMON节及其后续块,然后进行映射。


Code

See regex in use here 查看正则表达式在这里使用

(?:COMMON\s+0x[0-9a-f]+\s+(0x[0-9a-f]+)\s+(\S+)|\s*(0x[0-9a-f]+)\s+(\S+))(?=\s*[\r\n])

Explanation 说明

  • COMMON\\s+0x[0-9a-f]+\\s+(0x[0-9a-f]+)\\s+(\\S+) Option 1 COMMON\\s+0x[0-9a-f]+\\s+(0x[0-9a-f]+)\\s+(\\S+)选项1
    • COMMON\\s+0x[0-9a-f]+\\s+
      • COMMON The characters COMMON literally COMMON的字符COMMON字面上
      • \\s+ One or more whitespace characters \\s+一个或多个空格字符
      • 0x These characters 0x literally 0x这些字符从字面上看是0x
      • [0-9a-f]+ One or more of the characters in the set 0-9a-f [0-9a-f]+的一个或集合中的多个字符的0-9a-f
      • \\s+ One or more whitespace characters \\s+ \\s+一个或多个空格字符\\s+
    • (0x[0-9a-f]+) Capture the following into capture group 1 (0x[0-9a-f]+)将以下内容捕获到捕获组1中
      • 0x These characters 0x literally 0x这些字符从字面上看是0x
      • [0-9a-f]+ One or more of the characters in the set 0-9a-f [0-9a-f]+的一个或集合中的多个字符的0-9a-f
    • \\s+ One or more whitespace characters \\s+一个或多个空格字符
    • (\\S+) Capture one or more non-whitespace characters into capture group 2 (\\S+)一个或多个非空白字符捕获到捕获组2中
  • \\s*(0x[0-9a-f]+)\\s+(\\S+) Option 2 \\s*(0x[0-9a-f]+)\\s+(\\S+)选项2
    • \\s* Any number of whitespace characters \\s*任意数量的空格字符
    • (0x[0-9a-f]+) Capture the following into capture group 3 (0x[0-9a-f]+)将以下内容捕获到捕获组3中
      • 0x These characters 0x literally 0x这些字符从字面上看是0x
      • [0-9a-f]+ One or more of the characters in the set 0-9a-f [0-9a-f]+的一个或集合中的多个字符的0-9a-f
    • \\s+ One or more whitespace characters \\s+一个或多个空格字符
    • (\\S+) Capture one or more non-whitespace characters into capture group 4 (\\S+)一个或多个非空白字符捕获到捕获组4中
  • (?=\\s*[\\r\\n]) Ensure what follows is any number of whitespace characters, followed by a newline character \\r\\n (?=\\s*[\\r\\n])确保紧随其后的是任意数量的空格字符,后跟换行符\\r\\n

Usage 用法

Based on the order of the matches and the groups to which they belong, you can map them to an array as you've presented. 根据匹配的顺序及其所属的组,您可以将它们映射为所呈现的数组。

For example (in match order). 例如(按比赛顺序)。

  • First set 第一组
    • Group 1 0x1 组1 0x1
    • Group 2 ./2_Programa/source/board.o 第2组 ./2_Programa/source/board.o
    • Group 3 0x20002b18 组3 0x20002b18
    • Group 4 BOARD_ctx 第4组BOARD_ctx
  • Second set 第二组
    • Group 1 0x87 组1 0x87
    • Group 2 ./2_Programa/source/interface_objects.o 组2 ./2_Programa/source/interface_objects.o
    • Group 3 0x20002b19 组3 0x20002b19
    • Group 4 GLB_appIntObjPropChangeFlags 第4组GLB_appIntObjPropChangeFlags
    • Group 3 0x20002b1a 组3 0x20002b1a
    • Group 4 GLB_aioBLCommand 第4组GLB_aioBLCommand
    • Group 3 0x20002b65 组3 0x20002b65
    • Group 4 GLB_aioDateTime 第4组GLB_aioDateTime
  • etc. 等等

Always associating the last match for group 1 and group 2 to the current match for group 3 and group 4 始终将组1和组2的最后一场比赛与组3和组4的当前比赛相关联

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM