简体   繁体   中英

What is the difference between an XML attribute and an XML token?

I have been reading through BaseX's documentation and I found they offer a token index as well as an attribute token. However, it is not clear to me what the difference between the two is.

Attributes seem to be the regular attributes as I know them:

<node attribute="value"/>

However, for tokens the documentation reads:

In many XML dialects, such as HTML or DITA, multiple tokens are stored in attribute values.

So it would almost seem as if tokens are values for attributes? So, like this:

<node attribute="token1 token2"/>

If that is the case, what is indexed in both these cases then? If the attribute index improves equality checks such as

//country[@car_code = 'J']

and a token index improves containment checks such as

//div[contains-token(@class, 'row')]

isn't a token index then simply an advanced attribute index, working with multiple values? Or am I missing something? When would one use the one or the other, and are they ever useful in combination?

Unfortunately token means a few different things in different contexts in XPath, XML, XML Schema, DTDs, and other related technologies which can make it a bit unclear when the term comes up.

Here they are referring to token in the sense of a string made up of XML name chars.

Of the many ways that attributes can be defined, one case is as having multiple tokens separated by whitespace, with no meaning assigned to the order of such tokens. To take one of the examples you quote:

//div[contains-token(@class, 'row')]

This would match each of:

<div class="row">
<div class="row important">
<div class="important row">
<div class="important        row               warning">

It would not match any of:

<div class="rows">
<div class="arrow">

isn't a token index then simply an advanced attribute index, working with multiple values?

Yes. A very useful one. Writing a test for an attribute containing a value as a token so that it would match each of the four cases it should match above, but none of the two cases it shouldn't match would be very fiddly, and in a lot of cases this need comes up a lot (the example above matches the CSS selector div.row for example).

Also, note that while a very common use-case for this function is with attribute values, it operates on any string, so it could also be element text, the result of another string function, an entire imported document, etc.

When would one use the one or the other

Really it's a matter of what you care about. Is your query "I want to match all <div> s that have a class attribute of "row" " or is your query "I want to match all <div> s that have a class attribute that contains the token "row" . In HTML or XHTML considering how class is used, we'd probably be in the latter case most of the time.

and are they ever useful in combination?

In a way, they already are in combination; you are using the [] and @ to identify nodes that have a particular attribute, and then using the contains-token function to specify what you do in filtering the values of those attributes.

We generally wouldn't care to do both a = test and a contains-token test on the same attribute as generally the = should suffice; if we've a requirement of what the entire contents of the attribute must be then any requirement of what tokens are present is entailed by that. Of course all sorts of surprising rare cases can happen in coding, especially when we are bringing two or more separate criteria together. It's also more common to have both types working on separate attributes;

//a[@href = 'http://example.net/][contains-token(@class, 'cool')]

Would use = on one attribute and contains-token on another.

(Again, really contains-token isn't a type of index, its a string function that works on strings, that is often useful within indices).

The term "name token" originates from SGML attributes declared in a DTD like this

<!ATTLIST your-element an-attribute-name NMTOKEN #IMPLIED>

or an enumerated attribute value declaration like this

<!ATTLIST your-element an-attribute-name (value1|value2) #IMPLIED>

(or similarly with the attribute types ID , IDREF , NAME , NAMES , NMTOKENS etc., where a NMTOKEN attribute can begin with a digit, but NAME must not).

In an XML instance, a name token or enumerated attribute can be used as follows

<your-element an-attribute="whatever">

(for the first example), or

<your-element an-attribute="value1">

(for the second).

In SGML, enumerated attributes can be written in a short form syntax like this

<your-element value1>

eg. the attribute name can be omitted, provided that value1 is unique among all name tokens in the DTD.

HTML has so-called Boolean Attributes which are basically SGML enumerated attributes with the additional proviso that the attribute value must be identical to the attribute name. Examples:

<div hidden>
<option selected>

etc. HTML has the additional quirk that it may expose true / false as DOM attribute values.

That said, the documentation you linked says that

The XQuery functions fn:contains-token , fn:tokenize and fn:idref are rewritten for index access whenever possible.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM