简体   繁体   中英

What does this specific sed command do exactly? Using sed to parse full HTML pages in a bash script

subreddit=$(curl -sL "https://www.reddit.com/search/?q=${query}&type=sr"|tr "<" "\n"|
    sed -nE 's@.*class="_2torGbn_fNOMbGw3UAasPl">r/([^<]*)@\1@p'|gum filter)

I've been learning bash and have been making pretty good progress. One thing that just seems far too daunting is these complex sed commands. It's unfortunate because I really want to use them to do things like parse HTML but it quickly becomes a mess. this is a little snippet of a script that queries Reddit, pipes it through sed and returns just the names of the subreddits that were a result of the search on a new line.

My main question is.. What is it that this is actually cutting/replacing and what does the beginning part mean 's@.'?

What I tried:

I used curl to search for a subreddit name so that I could see the raw output from that command and then I tried to pipe it into sed using little snippets of the full command to see if I could reconstruct the logic behind the command and all I really figured out was that I am lacking in my knowledge of sed beyond basic replacements.

I'm trying to re-write this script (for learning purposes only, the script works just fine) that allows you to search reddit and view the image posts in your terminal using Kitty. Mostly everything is pretty readable but the sed commands just trip me up.

I'll attach the full script below in case anyone is interested and I welcome any advice or explanations that could help me fully understand and re-construct it.

I'm really curious about this. I'm also wondering if it would just be better to call a Python script from bash that could return the images using beautiful soup... or maybe using "htmlq" would be a better idea?

Thanks!

#!/bin/sh

get_input() {
  [ -z "$*" ] && query=$(gum input --placeholder "Search for a subreddit") || query=$*
  query=$(printf "%s" "$query"|tr ' ' '+')
  subreddit=$(curl -sL "https://www.reddit.com/search/?q=${query}&type=sr"|tr "<" "\n"|
    sed -nE 's@.*class="_2torGbn_fNOMbGw3UAasPl">r/([^<]*)@\1@p'|gum filter)
  xml=$(curl -s "https://www.reddit.com/r/$subreddit.rss" -A "uwu"|tr "<|>" "\n")

  post_href=$(printf "%s" "$xml"|sed -nE '/media:thumbnail/,/title/{p;n;p;}'|
    sed -nE 's_.*href="([^"]+)".*_\1_p;s_.*media:thumbnail[^>]+url="([^"]+)".*_\1_p; /title/{n;p;}'|
    sed -e 'N;N;s/\n/\t/g' -e 's/&amp;/\&/g'|grep -vE '.*\.gif.*')
  [ -z "$post_href" ] && printf "No results found for \"%s\"\n" "$query" && exit 1
}

readc() {
  if [ -t 0 ]; then
    saved_tty_settings=$(stty -g)
    stty -echo -icanon min 1 time 0
  fi
  eval "$1="
  while
    c=$(dd bs=1 count=1 2> /dev/null; echo .)
    c=${c%.}
    [ -n "$c" ] &&
        eval "$1=\${$1}"'$c
    [ "$(($(printf %s "${'"$1"'}" | wc -m)))" -eq 0 ]'; do
    continue
  done
  [ -t 0 ] && stty "$saved_tty_settings"
}

download_image() {
  downloadable_link=$(curl -s -A "uwu" "$1"|sed -nE 's@.*class="_3Oa0THmZ3f5iZXAQ0hBJ0k".*<a href="([^"]+)".*@\1@p')
  curl -s -A "uwu" "$downloadable_link" -o "$(basename "$downloadable_link")"
  [ -z "$downloadable_link" ] && printf "No image found\n" && exit 1
  tput clear && gum style \
      --foreground 212 --border-foreground 212 --border double \
        --align center --width 50 --margin "1 2" --padding "2 4" \
          'Your image has been downloaded!' "Image saved to $(basename "$downloadable_link")"
  # shellcheck disable=SC2034
  printf "Press Enter to continue..." && read -r useless
}

cleanup() {
  tput cnorm && exit 0
}

trap cleanup EXIT INT HUP
get_input "$@"

i=1 && tput clear
while true; do
  tput civis
  [ "$i" -lt 1 ] && i=$(printf "%s" "$post_href"|wc -l)
  [ "$i" -gt "$(printf "%s" "$post_href"|wc -l)" ] && i=1
  link=$(printf "%s" "$post_href"|sed -n "$i"p|cut -f1)
  post_link=$(printf "%s" "$post_href"|sed -n "$i"p|cut -f2)
  gum style \
    --foreground 212 --border-foreground 212 --border double \
    --align left --width 50 --margin "20 1" --padding "2 4" \
    'Press (j) to go to next' 'Press (k) to go to previous' 'Press (d) to download' \
    'Press (o) to open in browser' 'Press (s) to search for another subreddit' 'Press (q) to quit'
  kitty +kitten icat --scale-up --place 60x40@69x3 --transfer-mode file "$link"
  readc key
  # shellcheck disable=SC2154
  case "$key" in
    j) i=$((i+1)) && tput clear ;;
    k) i=$((i-1)) && tput clear ;;
    d) download_image "$post_link" ;;
    o) xdg-open "$post_link" || open "$post_link" ;;
    s) get_input ;;
    q) exit 0 && tput clear ;;
    *) ;;
  esac
done

"gum filter" is essentially a fuzzy finder like fzf and "gum style" draws pretty text and nice boxes that work kind of like css.

What does this specific sed command do exactly?
sed -nE 's@.*class="_2torGbn_fNOMbGw3UAasPl">r/([^<]*)@\1@p'

It does two things:

  1. Select all lines that contain the literal string class="_2torGbn_fNOMbGw3UAasPl">r/ .
  2. For those lines, print only the part after ...>r/ .

Basically, it translates to... (written inefficiently on purpose)

grep 'class="_2torGbn_fNOMbGw3UAasPl">r/' |
sed 's/.*>r\///'

what does the beginning part mean 's@.'?

You are looking at the (beginning of the) s ubstitution command. Normally, it is written as s/search/replace/ but the delimiter / can be chosen (mostly) freely. s/…/…/ and s@…@…@ are equivalent.
Here, @ has the benefit of not having to escape the / in …>r/ .

The . belongs to the search pattern. The .* in the beginning selects everything from the start of the line, so that it can be deleted when doing the substitution. Here we delete the beginning of the line up to (and including) …>r/ .
The \1 in the replacement pattern is a placeholder for the string that was matched by the group ([^<]*) (longest < -free substring after …>r/ ).
That part is unnecessarily complicated. Because sed is preceded by tr "<" "\n" there is no point in dealing with the < inside sed . It could be simplified to

sed -n 's@.*class="_2torGbn_fNOMbGw3UAasPl">r/@@p'

Speaking about simplifications:

I really want to use them to do things like parse HTML

Advice: Don't; at least in general. For one-off jobs where you know the exact formatting (,) of your html files it is ok. I guess, But in general. regexes are not powerful enough to reliably parse html.

Either way, it is easier to work with tools design for the job. Eg install libxml and use an XPath expression with post-processing:

curl -sL "https://www.reddit.com/search/?q=QUERY&type=sr" |
xmllint --html --xpath '//h6[@class="_2torGbn_fNOMbGw3UAasPl"]/text()' - 2>/dev/null |
sed 's@^r/@@'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM