简体   繁体   中英

XSD regular expressions: empty string OR something else

I am trying to enforce the XSD regular expressions that I find in the SEC's EDGAR schemas, either via C# or js.

I have the following XSD simple type:

<xs:simpleType name="ACCESSION_NUMBER_TYPE">
    <xs:restriction base="xs:token">
        <xs:pattern value="[*]{0}|[0-9]{1,10}\-[0-9]{1,2}\-[0-9]{1,6}"/>
    </xs:restriction>
</xs:simpleType>

It happens to come from the eis_Common.xsd, included in the zip file you can d/l from the SEC's EDGARLink Online page . A near duplicate definition can be found in eis_ABS_15GFiler.xsd, but the base for that type's restriction is xs:string .

<xs:simpleType name="ACCESSION_NUMBER_TYPE">
    <xs:restriction base="xs:string">
        <xs:pattern value="[*]{0}|[0-9]{1,10}\-[0-9]{1,2}\-[0-9]{1,6}"/>
    </xs:restriction>
</xs:simpleType>

For the above pattern, I would think that blank or null value would be allowed. I translate the above pattern as two clauses, OR'd together. The first clause ( [*]{0} ) matches...

the character class whose sole member is asterisk – CM Sperberg-McQueen

...zero times, which would mean empty string or a null XML node value. The second clause matches ( [0-9]{1,10}\\-[0-9]{1,2}\\-[0-9]{1,6} ) "one to ten digits, hyphen, one to two digits, hyphen, one to six digits".

But the SEC rejects an XML node corresponding to the above simple type which has a null or empty value.

This one particular pattern is the exception in my approach. For every other simple type that I've tested which is defined in the SEC's EDGAR schemas by a regex pattern, including multiple patterns, and unions of simple regex types, my approach works. It is this one expression for which I am generating XML that I would say is valid, but that the SEC rejects.

So this is a sanity check. If I wrap the above pattern expression, ^(<expr>)$ , and test against a null or empty string, it matches in both C# and js, due to the first clause. Correct? Am I missing something about XSD regex?


For a js sample, using regex101.com

Flavor: javascript

Regular Expression: ^([*]{0}|[0-9]{1,10}-[0-9]{1,2}-[0-9]{1,6})$

Modifiers: gm

Test String:

1-1-1

3

5
6-6-6

Matches: lines 1, 2, 4, 6

But the SEC essentially tells me that expression should only match 1, and 6.


@kjhughes

No, a blank (single whitespace character) would not be allowed.

To alleviate confusion I cleaned up some verbiage and replaced "blank" with "empty". What I meant was a string that would be empty in C# ( == "" ) or js ( === "" ). I would expect that to be treated the same as a null value, and be matched by ^([*]{0}|...)$ ( js: /^([*]{0}|...)$/ ). The XML snippet being tested would ultimately be:

...
<ns:ACCESSION_NUMBER_TYPE></ns:ACCESSION_NUMBER_TYPE>
...

Regular expressions in XSD are implicitly anchored at start and end with ^ and $.

I believe I understand the section of the xsd spec on implicit anchoring, which is why I have been trying to translate that into C# or js regex validation, by explicitly wrapping the xsd pattern in the begin line, capture, end line ( ^(...)$ ) anchors in the example above. For js it would additionally be wrapped in /.../ .

Is this not a safe assumption? This works for every other pattern in the EDGAR schemas that have been used by many end users over the course of many months, and several different contexts. That is approximately 60 patterns that I have seen no issues with.

Which is why I am confident in my assessment of what the pattern actually means within the scope of XSD regex, and I agree with your answer regarding treatment of null values. Would you extend it to a C#/js empty string, which would result in an XML node like I have illustrated above? Perhaps I have crept beyond the scope of my own question :D

For the above simple type, I would think that blank or null value would be allowed.

Yes, a null value (zero-length string) would be allowed.

No, a blank (single whitespace character) would not be allowed.

If I wrap the above pattern expression, ^()$, and test against a null or blank string, it matches in both C# and js, due to the first clause. Correct? Am I missing something about XSD regex?

Regular expressions in XSD are implicitly anchored at start and end with ^ and $ .

Per the spec :

Note : Unlike some popular regular expression languages (including those defined by Perl and standard Unix utilities), the regular expression language defined here implicitly anchors all regular expressions at the head and tail, as the most common use of regular expressions in ·pattern· is to match entire literals.


Update per further OP question edits

Yes, to be very concrete, this XML:

<a></a>

Would be valid against this XSD:

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">

  <xs:simpleType name="ACCESSION_NUMBER_TYPE">
    <xs:restriction base="xs:string">
      <xs:pattern value="[*]{0}|[0-9]{1,10}\-[0-9]{1,2}\-[0-9]{1,6}"/>
    </xs:restriction>
  </xs:simpleType>

  <xs:element name="a" type="ACCESSION_NUMBER_TYPE"/>

</xs:schema>

Would you extend it to a C#/js empty string, which would result in an XML node like I have illustrated above?

The string value of an empty element such as a shown above would be an empty string in C#, JavaScript, Java, Python, or any other language.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM