[英]How to validate a formatSpec string in MATLAB?
Many of the functions in MATLAB (and also other language that have a c-derived scanf
/ printf
) used for writing or reading strings (to name a few: sscanf
, sprintf
, textscan
) rely on the user supplying a valid formatSpec
string which tells the function the structure of the string-to-build or the string-to-parse. 用于编写或读取字符串的MATLAB中的许多函数(以及具有c-derived
scanf
/ printf
其他语言)(仅举几例: sscanf
, sprintf
, textscan
)依赖于用户提供有效的formatSpec
字符串,该字符串告诉函数是string-to-build或string-to-parse的结构。 I'm looking for a way to validate such a formatSpec
string before using it in a call to sprintf
. 我正在寻找一种方法来验证这样的
formatSpec
字符串,然后在调用sprintf
使用它。
In the case of sprintf
, the structure of formatSpec
is described in the documentation and is as follows: 在
sprintf
的情况下, formatSpec
的结构在文档中描述,如下所示:
Specifically I'd like to point out two aspects of formatSpec
: 具体来说,我想指出
formatSpec
两个方面:
- (✓) A formatting operator starts with a percent sign,
%
, and ends with a conversion character.(✓)的格式化操作者以百分号,开始
%
,并用转换字符结束。- (x)
formatSpec
can also include additional text before a percent sign,%
, or after a conversion character.(x)
formatSpec
还可以在百分号,%
之前或转换字符之后包含其他文本。
The solution I was thinking about involves using a regular expression to test the passed-in string. 我正在考虑的解决方案涉及使用正则表达式来测试传入的字符串。 What I have so far is an expression that seems to be able to match everything between the initial
%
and the conversion character , but not the "additional text" that may appear. 到目前为止我所拥有的是一个似乎能够匹配初始
%
和转换字符之间的所有内容的表达式,而不是可能出现的“附加文本”。
(%{1}(\d+\$)?[-+\s0#]*(\d+|\*)?(\.\d+)?[bt]?[diuoxXfeEgGcs]+)+
I wanted to also add the ability to capture "any printable text characters besides %
, '
and \\
, unless these characters appear exactly twice". 我还希望添加捕获“除
%
, '
和\\
之外的任何可打印文本字符的功能,除非这些字符恰好出现两次”。 This needs to be captured both before the initial %
and after the conversion character . 这需要在初始
%
之前和转换字符之后捕获。
[ -~]
[ -~]
%
, '
and \\
: (?![\\\\%'])
%
, '
和\\
: (?![\\\\%'])
( §§§§ |'{2}|\\\\{2}|%{2})
(§ = placeholder) ( §§§§ |'{2}|\\\\{2}|%{2})
(§=占位符) I am having a problem with the " unless ", that is, getting the negative look-ahead to discard single occurrences but allow double occurrences of the specified characters. 我遇到了“ 除非 ”的问题,即获得负面预测以丢弃单次出现但允许指定字符的双重出现。
formatSpec
strings (ie w/o regex, or using a better one)? formatSpec
字符串(即没有正则表达式,还是使用更好的字符串)? Disambiguation note: in the case of a formatSpec
string that has "free text" on both sides of formatting operators , the text should be considered a part of the next formatting operator , unless there are none left. 消歧说明:如果
formatSpec
字符串在格式化运算符的两边都有“自由文本”,则该文本应被视为下一个 格式化运算符的一部分, 除非没有剩余。 Below is an example of how a formatSpec
string should be split
using the regex (where |
is the first char of each match): 下面是如何使用正则表达式
split
formatSpec
字符串的示例(其中|
是每个匹配的第一个字符):
Color %s, number1 %d, number2 %05d, hex %#x, float %5.2f, unsigned value %u.
| | | | | | |
I've spent a bit of time on this, and I think I'm close, so I'll write up my current progress in an answer. 我花了一些时间在这上面,我觉得我很接近,所以我会在答案中写下我目前的进展。 I'm fairly certain it could still be improved.
我相当肯定它仍然可以改进。
First, the code, using the nice example string by Ro Yo Mi : 首先,代码,使用Ro Yo Mi的漂亮示例字符串 :
% valid input
sample_good = 'Color %s, we are looking for %%02droids %% number1 %d, number2 %05d, hex %#x, float %5.2f, unsigned value %u.';
% invalid input: "%02 droids" has a single percent sign which is not part of an operator
sample_bad = 'Color %s, we are looking for %02 droids %% number1 %d, number2 %05d, hex %#x, float %5.2f, unsigned value %u.';
group_from = '(';
group_to = ')';
printable = '([ -$&-\[\]-~]|%%|\\\\)*';
atomic_op = '(?<!%)%(\d+\$)?[ +#-]*(\d+|\*)?(\.\d*)?[bt]?[diuoxXfeEgGcs]';
% pattern for full validation
full_patt = ['^' group_from printable atomic_op group_to '*' printable '$'];
% pattern for splitting valid strings
part_patt = [printable atomic_op];
% examples
matches_full_bad = regexp(sample_bad,full_patt); % no match
matches_full_good = regexp(sample_good,full_patt); % match
matches_parts_good = regexp(sample_good,part_patt,'match'); % sliced matches
The first example string is valid, the second is broken due to %02 droids
being part of the string. 第一个示例字符串有效,第二个示例字符串由于
%02 droids
是字符串的一部分而中断。 I defined a few auxiliary patterns; 我定义了一些辅助模式; note that most of these have groups in them already.
请注意,其中大多数已经有组。 The
printable
pattern uses everything ASCII except %
and \\
, plus %%
and \\\\
. printable
模式使用除%
和\\
之外的所有ASCII,以及%%
和\\\\
。 Note that in order to match a double backslash, we need four backshlashes (two escaped backslashes) in the search expression. 请注意,为了匹配双反斜杠,我们需要在搜索表达式中使用四个反斜杠(两个转义反斜杠)。
What I call atomic_op
is a pattern that matches format operators starting with a percent sign and ending with a conversion character. 我称之为
atomic_op
是一种模式,它匹配以百分号开头并以转换字符结尾的格式运算符。 It uses negative lookbehind to avoid matching fake format operators starting with %%
. 它使用负向lookbehind来避免匹配以
%%
开头的伪格式运算符。 I took some shortcuts due to laziness (such as te
being valid in my version). 由于懒惰,我采取了一些快捷方式(例如
te
在我的版本中有效)。 It should be quite functional for not-too-evil inputs. 对于不太邪恶的输入,它应该是非常有用的。
The most important parts are full_patt
and part_patt
. 最重要的部分是
full_patt
和part_patt
。 The former tries to match a full format spec, in order to determine if it's valid. 前者尝试匹配完整格式规范,以确定它是否有效。 Unfortunately, in case of nested groups MATLAB only stores tokens for the outermost level;
不幸的是,在嵌套组的情况下,MATLAB只存储最外层的标记; in our case this would not be useful.
在我们的例子中,这没有用。 This is where
part_patt
comes into play: it only matches "printable_string format_operator". 这是
part_patt
发挥作用的地方:它只匹配“printable_string format_operator”。 Used together with full_patt
it can be used to slice up your full string to meaningful contributions. 与
full_patt
一起使用时,它可用于将完整字符串切片为有意义的贡献。 Note that part_patt
will often match an invalid string too at its locally-valid positions, so the two really have to be used together. 请注意,
part_patt
通常也会在其本地有效位置匹配无效字符串,因此这两者实际上必须一起使用。
Consider the specific example above: 考虑上面的具体示例:
>> matches_full_bad
matches_full_bad =
[]
>> matches_full_good
matches_full_good =
1
>> matches_parts_good{:}
ans =
Color %s
ans =
, we are looking for %%02droids %% number1 %d
ans =
, number2 %05d
ans =
, hex %#x
ans =
, float %5.2f
ans =
, unsigned value %u
Let's analyse the results. 让我们分析一下结果。 The "bad" pattern returns a (falsy) empty vector, while the "good" pattern returns a (truthy)
1
for the full pattern. “坏”模式返回(假的)空向量,而“好”模式返回完整模式的(真实)
1
。 The partial pattern then returns correctly each relevant subpattern of the input. 然后,部分模式正确返回输入的每个相关子模式。 Note, however, that the final period at the end of the sentence is missing from the result, since we matched blocks of
printable atomic_op
. 但请注意,结果中缺少句子末尾的最后句点,因为我们匹配了
printable atomic_op
块。 Since we know that we're working with a valid string, the rest of the string (after the final match) should be assign to either a new match, or to the final one, depending on your preference. 由于我们知道我们正在使用有效的字符串,因此字符串的其余部分(在最终匹配之后)应分配给新匹配或最终匹配,具体取决于您的偏好。
Just for clarity, here's how I imagine this to work: 为了清楚起见,我想象一下这是如何工作的:
for sample={sample_bad,sample_good},
if regexp(sample{1},full_patt)
disp('Match found!');
matches = regexp(sample,part_patt,'match');
matches = matches{1}; % strip outermost singleton cell dimension
for k=1:length(matches)
fprintf('Format substring #%d: %s\n',k, matches{k});
end
%TODO: treat final printable part of the string
else
disp('Uh-oh, no match!')
end
end
((?:[ -$&(-[\]-~]|([%'\\])\2)*(%(\d+\$)?[-+\s0#]?(\d+|\*)?(\.\d+)?[bt]?[diuoxXfeEgGcs]+)+(?:(?!(?:[ -$&(-[\]-~]|([%'\\])\7)*(?:%(?:\d+\$)?[-+\s0#]?(?:\d+|\*)?(?:\.\d+)?[bt]?[diuoxXfeEgGcs]+)+)(?:[ -$&(-[\]-~]|([%'\\])\8)*)?)
** To see the image better, simply right click the image and select view in new window **要更好地查看图像,只需右键单击图像并在新窗口中选择视图即可
This regular expression will do the following: 这个正则表达式将执行以下操作:
(?:[ -$&(-[\\]-~]|([%'\\\\])\\2)*
will match all printable characters from space to ~
, except %
, \\
, '
unless they appear exactly twice (?:[ -$&(-[\\]-~]|([%'\\\\])\\2)*
将匹配从空格到~
所有可打印字符,除了%
, \\
, '
除非它们恰好出现两次 (%(\\d+\\$)?[-+\\s0#]?(\\d+|\\*)?(\\.\\d+)?[bt]?[diuoxXfeEgGcs]+)+
is your expression (%(\\d+\\$)?[-+\\s0#]?(\\d+|\\*)?(\\.\\d+)?[bt]?[diuoxXfeEgGcs]+)+
是你的表达 (?:
starts the non-capture group (?:
启动非捕获组
(?!(?:[ -$&(-[\\]-~]|([%'\\\\])\\7)*(?:%(?:\\d+\\$)?[-+\\s0#]?(?:\\d+|\\*)?(?:\\.\\d+)?[bt]?[diuoxXfeEgGcs]+)+)
looks ahead to see if there are more format strings (?!(?:[ -$&(-[\\]-~]|([%'\\\\])\\7)*(?:%(?:\\d+\\$)?[-+\\s0#]?(?:\\d+|\\*)?(?:\\.\\d+)?[bt]?[diuoxXfeEgGcs]+)+)
向前看,看是否有更多的格式字符串 (?:[ -$&(-[\\]-~]|([%'\\\\])\\8)*)?
if there weren't more format strings above, then this will capture the remaining printable characters )
end of the capture group )
捕获组的结束 Live Demo 现场演示
https://regex101.com/r/sV4eX3/2 https://regex101.com/r/sV4eX3/2
Sample text 示范文本
Color %s, we are looking for %%02droids %% number1 %d, number2 %05d, hex %#x, float %5.2f, unsigned value %u.
Sample Matches 样本匹配
MATCH 1
1. [0-8] `Color %s`
3. [6-8] `%s`
MATCH 2
1. [8-53] `, we are looking for %%02droids %% number1 %d`
2. [40-41] `%`
3. [51-53] `%d`
MATCH 3
1. [53-67] `, number2 %05d`
3. [63-67] `%05d`
5. [65-66] `5`
MATCH 4
1. [67-76] `, hex %#x`
3. [73-76] `%#x`
MATCH 5
1. [76-89] `, float %5.2f`
3. [84-89] `%5.2f`
5. [85-86] `5`
6. [86-88] `.2`
MATCH 6
1. [89-109] `, unsigned value %u.`
3. [106-108] `%u`
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.