简体   繁体   English

XSD正则表达式:空字符串或其他内容

[英]XSD regular expressions: empty string OR something else

I am trying to enforce the XSD regular expressions that I find in the SEC's EDGAR schemas, either via C# or js. 我试图通过C#或js强制执行我在SEC的EDGAR模式中找到的XSD正则表达式。

I have the following XSD simple type: 我有以下XSD简单类型:

<xs:simpleType name="ACCESSION_NUMBER_TYPE">
    <xs:restriction base="xs:token">
        <xs:pattern value="[*]{0}|[0-9]{1,10}\-[0-9]{1,2}\-[0-9]{1,6}"/>
    </xs:restriction>
</xs:simpleType>

It happens to come from the eis_Common.xsd, included in the zip file you can d/l from the SEC's EDGARLink Online page . 它恰好来自eis_Common.xsd,包含在您可以从美国证券交易委员会的EDGARLink在线页面下载的zip文件中。 A near duplicate definition can be found in eis_ABS_15GFiler.xsd, but the base for that type's restriction is xs:string . 可以在eis_ABS_15GFiler.xsd中找到近似重复的定义,但该类型限制的基础是xs:string

<xs:simpleType name="ACCESSION_NUMBER_TYPE">
    <xs:restriction base="xs:string">
        <xs:pattern value="[*]{0}|[0-9]{1,10}\-[0-9]{1,2}\-[0-9]{1,6}"/>
    </xs:restriction>
</xs:simpleType>

For the above pattern, I would think that blank or null value would be allowed. 对于上面的模式,我认为将允许空值或空值。 I translate the above pattern as two clauses, OR'd together. 我将上述模式翻译成两个子句,或者一起翻译。 The first clause ( [*]{0} ) matches... 第一个子句( [*]{0} )匹配......

the character class whose sole member is asterisk – CM Sperberg-McQueen 唯一成员为星号的角色类 - CM Sperberg-McQueen

...zero times, which would mean empty string or a null XML node value. ...零次,这意味着空字符串或空XML节点值。 The second clause matches ( [0-9]{1,10}\\-[0-9]{1,2}\\-[0-9]{1,6} ) "one to ten digits, hyphen, one to two digits, hyphen, one to six digits". 第二个子句匹配( [0-9]{1,10}\\-[0-9]{1,2}\\-[0-9]{1,6} )“一到十位数,连字符,一个到两位数,连字符,一到六位数字“。

But the SEC rejects an XML node corresponding to the above simple type which has a null or empty value. 但SEC拒绝与上述具有null或空值的简单类型对应的XML节点。

This one particular pattern is the exception in my approach. 这种特殊模式是我的方法中的例外。 For every other simple type that I've tested which is defined in the SEC's EDGAR schemas by a regex pattern, including multiple patterns, and unions of simple regex types, my approach works. 对于我测试过的每个其他简单类型,我在SEC的EDGAR模式中通过正则表达式模式定义,包括多个模式和简单正则表达式类型的联合,我的方法可行。 It is this one expression for which I am generating XML that I would say is valid, but that the SEC rejects. 正是这一个表达式,我生成的XML,我认为是有效的,但SEC拒绝。

So this is a sanity check. 所以这是一个完整性检查。 If I wrap the above pattern expression, ^(<expr>)$ , and test against a null or empty string, it matches in both C# and js, due to the first clause. 如果我包装上面的模式表达式^(<expr>)$ ,并针对null或空字符串进行测试,则由于第一个子句,它在C#和js中都匹配 Correct? 正确? Am I missing something about XSD regex? 我错过了一些关于XSD正则表达式的东西吗?


For a js sample, using regex101.com 对于js样本,请使用regex101.com

Flavor: javascript 味道:javascript

Regular Expression: ^([*]{0}|[0-9]{1,10}-[0-9]{1,2}-[0-9]{1,6})$ 正则表达式:^([*] {0} | [0-9] {1,10} - [0-9] {1,2} - [0-9] {1,6})$

Modifiers: gm 修饰符:gm

Test String: 测试字符串:

1-1-1

3

5
6-6-6

Matches: lines 1, 2, 4, 6 比赛:第1,2,4,6行

But the SEC essentially tells me that expression should only match 1, and 6. 但美国证券交易委员会基本上告诉我表达式应该只匹配1和6。


@kjhughes @kjhughes

No, a blank (single whitespace character) would not be allowed. 不,不允许空白(单个空白字符)。

To alleviate confusion I cleaned up some verbiage and replaced "blank" with "empty". 为了缓解混乱,我清理了一些措辞并用“空”代替“空白”。 What I meant was a string that would be empty in C# ( == "" ) or js ( === "" ). 我的意思是在C#( == "" )或js( === "" )中为空的字符串。 I would expect that to be treated the same as a null value, and be matched by ^([*]{0}|...)$ ( js: /^([*]{0}|...)$/ ). 我希望将其视为空值,并与^([*]{0}|...)$js: /^([*]{0}|...)$/ ^([*]{0}|...)$ js: /^([*]{0}|...)$/ )。 The XML snippet being tested would ultimately be: 正在测试的XML片段最终将是:

...
<ns:ACCESSION_NUMBER_TYPE></ns:ACCESSION_NUMBER_TYPE>
...

Regular expressions in XSD are implicitly anchored at start and end with ^ and $. XSD中的正则表达式以^和$隐含地锚定在开头和结尾。

I believe I understand the section of the xsd spec on implicit anchoring, which is why I have been trying to translate that into C# or js regex validation, by explicitly wrapping the xsd pattern in the begin line, capture, end line ( ^(...)$ ) anchors in the example above. 我相信我理解了关于隐式锚定的xsd规范的部分,这就是为什么我一直试图将其转换为C#或js正则表达式验证,通过在开始行,捕获,结束行中显式地包装xsd模式( ^(...)$ )锚点在上面的例子中。 For js it would additionally be wrapped in /.../ . 对于js,它还将包装在/.../

Is this not a safe assumption? 这不是一个安全的假设吗? This works for every other pattern in the EDGAR schemas that have been used by many end users over the course of many months, and several different contexts. 这适用于EDGAR模式中的所有其他模式,这些模式已经被许多最终用户在几个月的过程中使用,以及几种不同的上下文。 That is approximately 60 patterns that I have seen no issues with. 这是我见过的大约60种模式没有问题。

Which is why I am confident in my assessment of what the pattern actually means within the scope of XSD regex, and I agree with your answer regarding treatment of null values. 这就是为什么我对我在XSD正则表达式范围内实际意味着什么的评估有信心,并且我同意关于空处理的答案。 Would you extend it to a C#/js empty string, which would result in an XML node like I have illustrated above? 你会把它扩展到一个C#/ js空字符串,这将产生一个像我上面说明的XML节点吗? Perhaps I have crept beyond the scope of my own question :D 也许我已经超出了我自己的问题的范围:D

For the above simple type, I would think that blank or null value would be allowed. 对于上面的简单类型,我认为将允许空值或空值。

Yes, a null value (zero-length string) would be allowed. 是的, 允许空值(零长度字符串)。

No, a blank (single whitespace character) would not be allowed. 不,一个空白的(单一空白字符)将不会被允许。

If I wrap the above pattern expression, ^()$, and test against a null or blank string, it matches in both C# and js, due to the first clause. 如果我包装上面的模式表达式^()$,并针对null或空字符串进行测试,则由于第一个子句,它在C#和js中都匹配。 Correct? 正确? Am I missing something about XSD regex? 我错过了一些关于XSD正则表达式的东西吗?

Regular expressions in XSD are implicitly anchored at start and end with ^ and $ . XSD中的正则表达式以^$隐含地锚定在开头和结尾。

Per the spec : 根据规格

Note : Unlike some popular regular expression languages (including those defined by Perl and standard Unix utilities), the regular expression language defined here implicitly anchors all regular expressions at the head and tail, as the most common use of regular expressions in ·pattern· is to match entire literals. 注意 :与一些流行的正则表达式语言(包括由Perl和标准Unix实用程序定义的语言)不同,这里定义的正则表达式语言隐式地将所有正则表达式锚定在头部和尾部,因为•pattern中正则表达式的最常见用法是匹配整个文字。


Update per further OP question edits 每进一步OP问题编辑更新

Yes, to be very concrete, this XML: 是的,非常具体,这个XML:

<a></a>

Would be valid against this XSD: 对此XSD有效:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xs:simpleType name="ACCESSION_NUMBER_TYPE">
    <xs:restriction base="xs:string">
      <xs:pattern value="[*]{0}|[0-9]{1,10}\-[0-9]{1,2}\-[0-9]{1,6}"/>
    </xs:restriction>
  </xs:simpleType>

  <xs:element name="a" type="ACCESSION_NUMBER_TYPE"/>

</xs:schema>

Would you extend it to a C#/js empty string, which would result in an XML node like I have illustrated above? 你会把它扩展到一个C#/ js空字符串,这将产生一个像我上面说明的XML节点吗?

The string value of an empty element such as a shown above would be an empty string in C#, JavaScript, Java, Python, or any other language. 一个空元素的诸如字符串值a如上所示将在C#,JavaScript中,使用Java,Python,或任何其他语言空字符串。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM