简体   繁体   中英

Retrieving Value from XML using grep and regular expressions

I have the below response being returned from my build system. The build generates multiple artifacts and I want to extract the link to particular artifact from the response below. Let us say something.exe.

<Artifacts>
    <artifact name="artifact1" version="1.0" buildId="13321123" make_target="beta" branch="branchName" date="2017-04-21 00:31:38.74856-07" 
            endtime="2017-04-21 00:59:54.680601-07"
            status="succeeded"
            change="e850b01967222464ffca02bf94dc711236fa978a"
            released="no">
        <file url="http://build.system.org/path/to/artifact/folder/MD5SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA1SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA256SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/something.exe"/><file url="http://build.system.org/path/to/artifact/folder/something_x64.msi"/>
    </artifact>
</Artifacts>

I would like to know a way to extract just the URL for something.exe. I have tried using piping the curl output and run a grep -E with a regular expression but that gives me the entire line instead.

curl -s --request GET http://build.system.org/path/to/artifact/folder/api/?build=13321123 | grep -E 'file url='
curl -s --request GET http://build.system.org/path/to/artifact/folder/api/?build=13321123 | | grep -E 'file url="http\S+OVF10.ova"'

Is there a way to just extract the following ?

http://build.system.org/path/to/artifact/folder/something.exe

The righteous way would be to use XML tools in this case, such as xmlstarlet

But that, of course, requires a valid XML structure. A valid XML structure would look like:

<artifact name="artifact1" version="1.0" buildId="13321123" make_target="beta" branch="branchName" date="2017-04-21 00:31:38.74856-07" 
        endtime="2017-04-21 00:59:54.680601-07"
       status="succeeded"
       change="e850b01967222464ffca02bf94dc711236fa978a"
       released="no">
    <file url="http://build.system.org/path/to/artifact/folder/MD5SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA1SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/SHA256SUM.txt"/><file url="http://build.system.org/path/to/artifact/folder/something.exe"/><file url="http://build.system.org/path/to/artifact/folder/something_x64.msi"/>
</artifact>

The command:

xmlstarlet sel -t -v "//artifact/file[contains(@url,'something.exe')]/@url" -n xmlfile

The output:

http://build.system.org/path/to/artifact/folder/something.exe

-v option (or --value-of ) - print value of XPATH expression

The XPATH contains() function returns true if the first argument string contains the second argument string, and otherwise returns false .

As RomanPerekhrest said, use an xml parser for this kind of task. For your example input you could use xmlstarlet like this:

xml sel -t -m 'Artifacts/artifact/file [contains(@url, "something.exe")]' -v @url

Output:

http://build.system.org/path/to/artifact/folder/something.exe

This regex should work: ([\\w\\d\\s]*.exe)"\\/> (it searches for a string that consists of (/somename.exe"/> , where someonemae must consist of letters, digits, or basic space signs ("_","-"," ").

$ regex="([\w\d\s]*.exe)"\/>"
$ echo $input | grep -oP "$regex"

Though, as someone mentioned above, you shouldn't use regex to parse xml, use xml parsers.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM