简体   繁体   English

如何在MATLAB中验证formatSpec字符串?

[英]How to validate a formatSpec string in MATLAB?

Many of the functions in MATLAB (and also other language that have a c-derived scanf / printf ) used for writing or reading strings (to name a few: sscanf , sprintf , textscan ) rely on the user supplying a valid formatSpec string which tells the function the structure of the string-to-build or the string-to-parse. 用于编写或读取字符串的MATLAB中的许多函数(以及具有c-derived scanf / printf其他语言)(仅举几例: sscanfsprintftextscan )依赖于用户提供有效的formatSpec字符串,该字符串告诉函数是string-to-build或string-to-parse的结构。 I'm looking for a way to validate such a formatSpec string before using it in a call to sprintf . 我正在寻找一种方法来验证这样的formatSpec字符串,然后在调用sprintf使用它。

In the case of sprintf , the structure of formatSpec is described in the documentation and is as follows: sprintf的情况下, formatSpec的结构在文档中描述如下所示:

MATLAB的sprintf formatSpec

Specifically I'd like to point out two aspects of formatSpec : 具体来说,我想指出formatSpec两个方面:

  • (✓) A formatting operator starts with a percent sign, % , and ends with a conversion character. (✓)的格式化操作者以百分号,开始% ,并用转换字符结束。
  • (x) formatSpec can also include additional text before a percent sign, % , or after a conversion character. (x) formatSpec还可以在百分号, %之前或转换字符之后包含其他文本。

The solution I was thinking about involves using a regular expression to test the passed-in string. 我正在考虑的解决方案涉及使用正则表达式来测试传入的字符串。 What I have so far is an expression that seems to be able to match everything between the initial % and the conversion character , but not the "additional text" that may appear. 到目前为止我所拥有的是一个似乎能够匹配初始%转换字符之间的所有内容的表达式,而不是可能出现的“附加文本”。

(%{1}(\d+\$)?[-+\s0#]*(\d+|\*)?(\.\d+)?[bt]?[diuoxXfeEgGcs]+)+

I wanted to also add the ability to capture "any printable text characters besides % , ' and \\ , unless these characters appear exactly twice". 我还希望添加捕获“除%'\\之外的任何可打印文本字符的功能,除非这些字符恰好出现两次”。 This needs to be captured both before the initial % and after the conversion character . 这需要在初始%之前和转换字符之后捕获。

  • any printable character : [ -~] 任何可打印的字符[ -~]
  • besides % , ' and \\ : (?![\\\\%']) 除了%'\\(?![\\\\%'])
  • these characters appear exactly twice : ( §§§§ |'{2}|\\\\{2}|%{2}) (§ = placeholder) 这些字符恰好出现两次 : ( §§§§ |'{2}|\\\\{2}|%{2}) (§=占位符)

I am having a problem with the " unless ", that is, getting the negative look-ahead to discard single occurrences but allow double occurrences of the specified characters. 我遇到了“ 除非 ”的问题,即获得负面预测以丢弃单次出现但允许指定字符的双重出现。

I have two questions: 我有两个问题:

  1. Is there a better way to validate formatSpec strings (ie w/o regex, or using a better one)? 有没有更好的方法来验证formatSpec字符串(即没有正则表达式,还是使用更好的字符串)?
  2. How do I fix my regex so that it works as described? 如何修复我的正则表达式,使其按照描述的方式工作?

Disambiguation note: in the case of a formatSpec string that has "free text" on both sides of formatting operators , the text should be considered a part of the next formatting operator , unless there are none left. 消歧说明:如果formatSpec字符串在格式化运算符的两边都有“自由文本”,则该文本应被视为下一个 格式化运算符的一部分, 除非没有剩余。 Below is an example of how a formatSpec string should be split using the regex (where | is the first char of each match): 下面是如何使用正则表达式split formatSpec字符串的示例(其中|是每个匹配的第一个字符):

Color %s, number1 %d, number2 %05d, hex %#x, float %5.2f, unsigned value %u.
|       |           |             |        |            |                   |

I've spent a bit of time on this, and I think I'm close, so I'll write up my current progress in an answer. 我花了一些时间在这上面,我觉得我很接近,所以我会在答案中写下我目前的进展。 I'm fairly certain it could still be improved. 我相当肯定它仍然可以改进。

First, the code, using the nice example string by Ro Yo Mi : 首先,代码,使用Ro Yo Mi的漂亮示例字符串

% valid input
sample_good = 'Color %s, we are looking for %%02droids %% number1 %d, number2 %05d, hex %#x, float %5.2f, unsigned value %u.';
% invalid input: "%02 droids" has a single percent sign which is not part of an operator
sample_bad = 'Color %s, we are looking for %02 droids %% number1 %d, number2 %05d, hex %#x, float %5.2f, unsigned value %u.';

group_from = '(';
group_to = ')';
printable = '([ -$&-\[\]-~]|%%|\\\\)*';
atomic_op = '(?<!%)%(\d+\$)?[ +#-]*(\d+|\*)?(\.\d*)?[bt]?[diuoxXfeEgGcs]';

% pattern for full validation
full_patt = ['^' group_from printable atomic_op group_to '*' printable '$'];
% pattern for splitting valid strings
part_patt = [printable atomic_op];

% examples
matches_full_bad = regexp(sample_bad,full_patt);            % no match
matches_full_good = regexp(sample_good,full_patt);          % match
matches_parts_good = regexp(sample_good,part_patt,'match'); % sliced matches

The first example string is valid, the second is broken due to %02 droids being part of the string. 第一个示例字符串有效,第二个示例字符串由于%02 droids是字符串的一部分而中断。 I defined a few auxiliary patterns; 我定义了一些辅助模式; note that most of these have groups in them already. 请注意,其中大多数已经有组。 The printable pattern uses everything ASCII except % and \\ , plus %% and \\\\ . printable模式使用除%\\之外的所有ASCII,以及%%\\\\ Note that in order to match a double backslash, we need four backshlashes (two escaped backslashes) in the search expression. 请注意,为了匹配双反斜杠,我们需要在搜索表达式中使用四个反斜杠(两个转义反斜杠)。

What I call atomic_op is a pattern that matches format operators starting with a percent sign and ending with a conversion character. 我称之为atomic_op是一种模式,它匹配以百分号开头并以转换字符结尾的格式运算符。 It uses negative lookbehind to avoid matching fake format operators starting with %% . 它使用负向lookbehind来避免匹配以%%开头的伪格式运算符。 I took some shortcuts due to laziness (such as te being valid in my version). 由于懒惰,我采取了一些快捷方式(例如te在我的版本中有效)。 It should be quite functional for not-too-evil inputs. 对于不太邪恶的输入,它应该是非常有用的。

The most important parts are full_patt and part_patt . 最重要的部分是full_pattpart_patt The former tries to match a full format spec, in order to determine if it's valid. 前者尝试匹配完整格式规范,以确定它是否有效。 Unfortunately, in case of nested groups MATLAB only stores tokens for the outermost level; 不幸的是,在嵌套组的情况下,MATLAB只存储最外层的标记; in our case this would not be useful. 在我们的例子中,这没有用。 This is where part_patt comes into play: it only matches "printable_string format_operator". 这是part_patt发挥作用的地方:它只匹配“printable_string format_operator”。 Used together with full_patt it can be used to slice up your full string to meaningful contributions. full_patt一起使用时,它可用于将完整字符串切片为有意义的贡献。 Note that part_patt will often match an invalid string too at its locally-valid positions, so the two really have to be used together. 请注意, part_patt通常也会在其本地有效位置匹配无效字符串,因此这两者实际上必须一起使用。

Consider the specific example above: 考虑上面的具体示例:

>> matches_full_bad

matches_full_bad =

     []

>> matches_full_good

matches_full_good =

     1

>> matches_parts_good{:}

ans =

Color %s


ans =

, we are looking for %%02droids %% number1 %d


ans =

, number2 %05d


ans =

, hex %#x


ans =

, float %5.2f


ans =

, unsigned value %u

Let's analyse the results. 让我们分析一下结果。 The "bad" pattern returns a (falsy) empty vector, while the "good" pattern returns a (truthy) 1 for the full pattern. “坏”模式返回(假的)空向量,而“好”模式返回完整模式的(真实) 1 The partial pattern then returns correctly each relevant subpattern of the input. 然后,部分模式正确返回输入的每个相关子模式。 Note, however, that the final period at the end of the sentence is missing from the result, since we matched blocks of printable atomic_op . 但请注意,结果中缺少句子末尾的最后句点,因为我们匹配了printable atomic_op块。 Since we know that we're working with a valid string, the rest of the string (after the final match) should be assign to either a new match, or to the final one, depending on your preference. 由于我们知道我们正在使用有效的字符串,因此字符串的其余部分(在最终匹配之后)应分配给新匹配或最终匹配,具体取决于您的偏好。

Just for clarity, here's how I imagine this to work: 为了清楚起见,我想象一下这是如何工作的:

for sample={sample_bad,sample_good},
    if regexp(sample{1},full_patt)
        disp('Match found!');
        matches = regexp(sample,part_patt,'match');
        matches = matches{1};   % strip outermost singleton cell dimension
        for k=1:length(matches)
            fprintf('Format substring #%d: %s\n',k, matches{k});
        end
        %TODO: treat final printable part of the string
    else
        disp('Uh-oh, no match!')
    end
end

Description 描述

((?:[ -$&(-[\]-~]|([%'\\])\2)*(%(\d+\$)?[-+\s0#]?(\d+|\*)?(\.\d+)?[bt]?[diuoxXfeEgGcs]+)+(?:(?!(?:[ -$&(-[\]-~]|([%'\\])\7)*(?:%(?:\d+\$)?[-+\s0#]?(?:\d+|\*)?(?:\.\d+)?[bt]?[diuoxXfeEgGcs]+)+)(?:[ -$&(-[\]-~]|([%'\\])\8)*)?)

正则表达式可视化

** To see the image better, simply right click the image and select view in new window **要更好地查看图像,只需右键单击图像并在新窗口中选择视图即可

This regular expression will do the following: 这个正则表达式将执行以下操作:

  • (?:[ -$&(-[\\]-~]|([%'\\\\])\\2)* will match all printable characters from space to ~ , except % , \\ , ' unless they appear exactly twice (?:[ -$&(-[\\]-~]|([%'\\\\])\\2)*将匹配从空格到~所有可打印字符,除了%\\'除非它们恰好出现两次
  • (%(\\d+\\$)?[-+\\s0#]?(\\d+|\\*)?(\\.\\d+)?[bt]?[diuoxXfeEgGcs]+)+ is your expression (%(\\d+\\$)?[-+\\s0#]?(\\d+|\\*)?(\\.\\d+)?[bt]?[diuoxXfeEgGcs]+)+是你的表达
  • (?: starts the non-capture group (?:启动非捕获组
    • (?!(?:[ -$&(-[\\]-~]|([%'\\\\])\\7)*(?:%(?:\\d+\\$)?[-+\\s0#]?(?:\\d+|\\*)?(?:\\.\\d+)?[bt]?[diuoxXfeEgGcs]+)+) looks ahead to see if there are more format strings (?!(?:[ -$&(-[\\]-~]|([%'\\\\])\\7)*(?:%(?:\\d+\\$)?[-+\\s0#]?(?:\\d+|\\*)?(?:\\.\\d+)?[bt]?[diuoxXfeEgGcs]+)+)向前看,看是否有更多的格式字符串
    • (?:[ -$&(-[\\]-~]|([%'\\\\])\\8)*)? if there weren't more format strings above, then this will capture the remaining printable characters 如果上面没有更多的格式字符串,那么这将捕获剩余的可打印字符
    • ) end of the capture group )捕获组的结束

Example

Live Demo 现场演示

https://regex101.com/r/sV4eX3/2 https://regex101.com/r/sV4eX3/2

Sample text 示范文本

Color %s, we are looking for %%02droids %% number1 %d, number2 %05d, hex %#x, float %5.2f, unsigned value %u.

Sample Matches 样本匹配

MATCH 1
1.  [0-8]   `Color %s`
3.  [6-8]   `%s`

MATCH 2
1.  [8-53]  `, we are looking for %%02droids %% number1 %d`
2.  [40-41] `%`
3.  [51-53] `%d`

MATCH 3
1.  [53-67] `, number2 %05d`
3.  [63-67] `%05d`
5.  [65-66] `5`

MATCH 4
1.  [67-76] `, hex %#x`
3.  [73-76] `%#x`

MATCH 5
1.  [76-89] `, float %5.2f`
3.  [84-89] `%5.2f`
5.  [85-86] `5`
6.  [86-88] `.2`

MATCH 6
1.  [89-109]    `, unsigned value %u.`
3.  [106-108]   `%u`

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM