简体   繁体   中英

case insensitive search - xpath

I'm trying to do a case-insensitive search on my XML document using the below XPath expression. Apparently, I'm got it incorrectly, since the results are different.Hoping someone here can point out my mistake?

I'm trying to get a count of all Obj elements under <Sect> where the <Header> value is Primary Objectives. To get the count, I'm using the below expression which works great.

Expression - without case sensitivity: Returns 31 nodes.

("count(//TaggedPDF-doc//Part//Sect//Sect//Sect[contains(Header,\"Primary objectives\")]//OBJ)");

But I want to make "Primary Objectives" case insensitive. So,I was trying to use Translate for that. Expression - adding translation to make "Primary Objectives" case insensitive.

Returns 0 nodes.

$count = $dom->findvalue("count(//TaggedPDF-doc//Part//Sect//Sect//Sect[contains(H4,
         translate(\"Primary Objectives\", 
                   'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 
                   'abcdefghjiklmnopqrstuvwxyz')
         )
]//OBJ)");

Hoping someone here can point out where I got this wrong.

Thanks in advance, Simak

First off, you probably don't need all those // steps as a // allows for any number of levels of elements between the nodes named on either side - either enumerate the full path from the root using single / steps, or just use one // to search the whole tree.

Secondly, you need to downcase the Header value you're comparing, not the fixed string you're comparing against. Try something more like

count(//Sect[
          Header[
            contains(
              translate(
                .,
                'ABCDEFGHIJKLMNOPQRSTUVWXYZ',
                'abcdefghijklmnopqrstuvwxyz'),
              'primary objectives'
            )
          ]
        ]//Obj)

which would give you the count of Obj elements that occur anywhere inside a Sect that has any Header child containing "primary objectives" (case-insensitive). This is slightly different from

count(//Sect[contains(translate(Header, ....

in the case where the Sect contains more than one Header - the latter would only check the first Header in each Sect rather than looking for a match in any of them.

If you've got access to an XPath 2.0 (or better) implementation - which is included in XQuery -, you could use

count(
  //TaggedPDF-doc//Part//Sect//Sect//Sect[
    contains(lower-case(H4), 'exclusion criteria')
  ]//OBJ
)

Perl interfaces for XPath 2.0 processors (actually XML databases with XQuery support) exist for eXist DB , BaseX , Saxon and lots of others .

You need to fold both strings:

contains(translate(Header, '...', '...'), 'primary objectives')

Note that you can use

# Letters of "primary objectives"
'ABCEIJMOPRSTVY', 'abceijmoprstvy'

instead of the larger but still limited set

 # Some of the latin letters
'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz'

What you are trying to do is checking if content of H4 contains "Exclusion criters" converted to lowercase.

count = $dom->findvalue("count(//TaggedPDF-doc//Part//Sect//Sect//Sect[contains(H4, translate(\\"Exclusion criteria\\", 'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 'abcdefghjiklmnopqrstuvwxyz') )

]//OBJ)");

it would be the same as doing:

count = $dom->findvalue("count(//TaggedPDF-doc//Part//Sect//Sect//Sect[contains(
        H4, \"exclusion criteria\"
     )
]//OBJ)");

What you want is translate the content of H4 to lowercase, and compare it to the lowercase version of what you search; in this case \\"exclusion criteria\\" :

count = $dom->findvalue("count(//TaggedPDF-doc//Part//Sect//Sect//Sect[contains(
     translate(H4, 
         'ABCDEFGHJIKLMNOPQRSTUVWXYZ', 
         'abcdefghjiklmnopqrstuvwxyz'), 
     \"exclusion criteria\"
     )
]//OBJ)");

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM