简体   繁体   中英

Filter out single instance of string from a single line containing multiple similar matches with grep or sed?

I have been making a shell script to be able to download a certain experimental branch of Blender from their website. When curling the site all versions appear in a really (and I mean really long) string of all the html together. I can grep (ripgrep spcecifically) only the Linux versions, but when wanting to grep or even sed again, all the filenames start with "https://" and end with ".tar.xz".

And they are all on the same line, so matching the beginning of the first also matches the end of the very last match.

os linux" ><a href="https://builder.blender.org/download/experimental/blender-3.0.0-alpha+asset-browser-poselib.fba8de2e8688-linux.x86_64-release.tar.xz" title="Download linux 64bit tar.xz file" class="js-ga" ga_label="linux 64bit tar.xz file" ga_type="button" ga_cat="download"><span class="name">Blender 3.0.0 - <span class="build-var next">asset-browser-poselib</span><small>May 22, 05:26:55 - asset-browser-poselib - fba8de2e8688 - tar.xz - 149.56MB</small></span><span class="build">x64</span><span class="size">release</span></a></li><li class="os linux" style="display:none;"><a href="https://builder.blender.org/download/experimental/blender-3.0.0-alpha+asset-browser-poselib.fba8de2e8688-linux.x86_64-release.tar.xz.sha256" title="Download linux 64bit sha256 file" class="js-ga" ga_label="linux 64bit sha256 file" ga_type="button" ga_cat="download"><span class="name">Blender 3.0.0 - <span class="build-var next">asset-browser-poselib</span><small>May 22, 05:26:55 - asset-browser-poselib - fba8de2e8688 - sha256 - 65.00B</small></span><span class="build">x64</span><span class="size">release</span></a></li><li class="os linux" ><a href="https://builder.blender.org/download/experimental/blender-3.0.0-alpha+cycles-x.a117a9c63c3a-linux.x86_64-release.tar.xz" title="Download linux 64bit tar.xz file" class="js-ga" ga_label="linux 64bit tar.xz file" ga_type="button" ga_cat="download"><span class="name">Blender 3.0.0 - <span class="build-var next">cycles-x</span><small>May 22, 05:03:02 - cycles-x - a117a9c63c3a - tar.xz - 143.11MB</small></span><span class="build">x64</span><span class="size">release</span></a></li><li class="os linux" style="display:none;"><a href="https://builder.blender.org/download/experimental/blender-3.0.0-alpha+cycles-x.a117a9c63c3a-linux.x86_64-release.tar.xz.sha256" title="Download linux 64bit sha256 file" class="js-ga" ga_label="linux 64bit sha256 file" ga_type="button" ga_cat="download"><span class="name">Blender 3.0.0 - <span class="build-var next">cycles-x</span><small>May 22, 05:03:02 - cycles-x - a117a9c63c3a - sha256 - 65.00B</small></span><span class="build">x64</span><span class="size">release</span></a></li><li class="os linux" style="display:none;"><a href="https://builder.blender.org/download/experimental/blender-3.0.0-alpha+override-recursive-resync.0d2c5bf06726-linux.x86_64-debug.tar.xz.sha256" title="Download linux 64bit sha256 file" class="js-ga" ga_label="linux 64bit sha256 file" ga_type="button" ga_cat="download"><span class="name">Blender 3.0.0 - <span class="build-var next">override-recursive-resync</span><small>May 20, 12:38:57 - override-recursive-resync - 0d2c5bf06726 - sha256 - 65.00B</small></span><span class="build">x64</span><span class="size">debug</span></a></li><li class="os linux" ><a href="https://builder.blender.org/download/experimental/blender-3.0.0-alpha+override-recursive-resync.0d2c5bf06726-linux.x86_64-debug.tar.xz" title="Download linux 64bit tar.xz file" class="js-ga" ga_label="linux 64bit tar.xz file" ga_type="button" ga_cat="download"><span class="name">Blender 3.0.0 - <span class="build-var next">override-recursive-resync</span><small>May 20, 12:38:56 - override-recursive-resync - 0d2c5bf06726 - tar.xz - 157.56MB</small></span><span class="build">x64</span><span class="size">debug</span></a></li><li class="os linux" ><a href="https://builder.blender.org/download/experimental/blender-3.0.0-alpha+override-recursive-resync.0d2c5bf06726-linux.x86_64-release.tar.xz" title="Download linux 64bit tar.xz file" class="js-ga" ga_label="linux 64bit tar.xz file" ga_type="button" ga_cat="download"><span class="name">Blender 3.0.0 - <span class="build-var next">override-recursive-resync</span><small>May 20, 11:50:22 - override-recursive-resync - 0d2c5bf06726 - tar.xz - 149.73MB</small></span><span class="build">x64</span><span class="size">release</span></a></li><li class="os linux" style="display:none;"><a href="https://builder.blender.org/download/experimental/blender-3.0.0-alpha+override-recursive-resync.0d2c5bf06726-linux.x86_64-release.tar.xz.sha256" title="Download linux 64bit sha256 file" class="js-ga" ga_label="linux 64bit sha256 file" ga_type="button" ga_cat="download"><span class="name">Blender 3.0.0 - <span class="build-var next">override-recursive-resync</span><small>May 20, 11:50:22 - override-recursive-resync - 0d2c5bf06726 - sha256 - 65.00B</small></span><span class="build">x64</span><span class="size">release</span></a></li><li class="os linux" ><a href="https://builder.blender.org/download/experimental/blender-3.0.0-alpha+profiler-editor.ab200c6eddc6-linux.x86_64-release.tar.xz" title="Download linux 64bit tar.xz file" class="js-ga" ga_label="linux 64bit tar.xz file" ga_type="button" ga_cat="download"><span class="name">Blender 3.0.0 - <span class="build-var next">profiler-editor</span><small>May 20, 04:54:26 - profiler-editor - ab200c6eddc6 - tar.xz - 149.54MB</small></span><span class="build">x64</span><span class="size">release</span></a></li><li class="os linux" style="display:none;"><a href="https://builder.blender.org/download/experimental/blender-3.0.0-alpha+profiler-editor.ab200c6eddc6-linux.x86_64-release.tar.xz

I trued using ripgrep (or grep): rg -o 'https.*tar\.xz' But that is exactly what matches from the first filename all the way to the last, maybe using AND logic in grep could help?

The URL from the string that I want is the following:

https://builder.blender.org/download/experimental/blender-3.0.0-alpha+cycles-x.a117a9c63c3a-linux.x86_64-release.tar.xz

How could I filter out that specific URL string if they start and end the same?

With GNU grep using non-greedy matching, we could try following.

grep -oP 'https?:\/\/.*?tar\.xz' Input_file

Explanation: Simply using -o option to print matched part only, using -P option to enable PCRE regex with grep here. Then matching from http OR https to till tar.xz using non-greedy match here. It will print all matched values from file.

NOTE: If you are happy with grep results above, which will print them on terminal and you want to save output into Input_file itself then append > temp && mv temp Input_file to above code.

Here's a way using the CLI HTML parser pup :

curl -s https://builder.blender.org/download/experimental/ \
    | pup 'li.linux > a[href*="cycles-x"] attr{href}' \
    | grep '\.tar\.xz$'

printing

https://builder.blender.org/download/experimental/blender-3.0.0-alpha+cycles-x.a117a9c63c3a-linux.x86_64-release.tar.xz

The selector li.linux > a[href*="cycles-x"] selects <a> elements that contain cycles-x in their href attribute, for all links that are children of a list item with class linux .The display function attr{href} prints the value of the href attribute.

This returns two lines: the URL we want, and the URL for the checksum. CSS supports multiple attribute selectors as in a[href*="cycles-x"][href$=".tar.xz"] , but pup doesn't – hence the grep filter.

You can use

grep -o 'https[^[:space:]"'"'"']*tar\.xz'

See the online demo .

Details

  • https - a https string
  • [^[:space:]"']* - zero or more chars other than whitespace, " and '
  • tar\.xz - tar.xz string.

You could put a new line after each instance of '.tar.xz' with:

sed -i 's/\.tar\.xz/.tar.xz\n/g' your_file 

Then remove everything up to 'https' with:

sed -i 's/.*href="//' your_file

to change the file to this:

https://builder.blender.org/download/experimental/blender-3.0.0-alpha+asset-browser-poselib.fba8de2e8688-linux.x86_64-release.tar.xz
https://builder.blender.org/download/experimental/blender-3.0.0-alpha+asset-browser-poselib.fba8de2e8688-linux.x86_64-release.tar.xz
https://builder.blender.org/download/experimental/blender-3.0.0-alpha+cycles-x.a117a9c63c3a-linux.x86_64-release.tar.xz
https://builder.blender.org/download/experimental/blender-3.0.0-alpha+cycles-x.a117a9c63c3a-linux.x86_64-release.tar.xz
https://builder.blender.org/download/experimental/blender-3.0.0-alpha+override-recursive-resync.0d2c5bf06726-linux.x86_64-debug.tar.xz
https://builder.blender.org/download/experimental/blender-3.0.0-alpha+override-recursive-resync.0d2c5bf06726-linux.x86_64-debug.tar.xz
https://builder.blender.org/download/experimental/blender-3.0.0-alpha+override-recursive-resync.0d2c5bf06726-linux.x86_64-release.tar.xz
https://builder.blender.org/download/experimental/blender-3.0.0-alpha+override-recursive-resync.0d2c5bf06726-linux.x86_64-release.tar.xz
https://builder.blender.org/download/experimental/blender-3.0.0-alpha+profiler-editor.ab200c6eddc6-linux.x86_64-release.tar.xz
https://builder.blender.org/download/experimental/blender-3.0.0-alpha+profiler-editor.ab200c6eddc6-linux.x86_64-release.tar.xz

Edit: @Wiktor Stribiżew has a better answer

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM