简体   繁体   中英

Workaround for 'cut' when it's piped a few lines with differing pattern from grep?

This is a work in progress and I'm looking for advice from someone with more knowledge (computers are a hobby of mine, not my major).

This script is meant to organize a directory of television shows (renaming each file to the convention S01E01.Title of Episode.ext and creating a symbolic link of the original name).

I enjoy writing this, and I don't expect others to dedicate too much of their time. I guess my biggest "stumpers" right now are:

  1. Fix the occurence of "<_a href=" (minus underline) from grep+cut output.
  2. Grab correct textblock with awk from wiki (according to season#)

    (also, if anything looks inefficient, please let me know-- I'm learning)

I've been on these forums left and right, making progress as I go. I've exhausted most of the similar questions to mine. (these forums are 100% the reason I've built this this far).

## Find show name and season (directories nested: /show/season)
show1=$(cd .. ; pwd)
show="${show1##*/}"
season=("${PWD##*/}")

IFS=$'\n'

## Download list of episodes for given season
wget -q -O- --header\="Accept-Encoding: gzip" https://en.wikipedia.org/wiki/List_of_$show\_episodes | gunzip > tmp.html

## Working on first awk/sed command to grab textblock of only specific season
## grep command works great, except when episode is hyperlinked ('a href' tag gets cut)
if [ "$season" == 'Season 1' ]; then
        listing=( $(awk '/\(season_1\)/,/rellink/' tmp.html | grep "summary.*[\"]<" | cut -d'"' -f6) )
        unset IFS
elif [ "$season" == 'Season 2' ]; then
        listing=( $(awk '/\(season_2\)/,/rellink/' tmp.html | grep "summary.*[\"]<" | cut -d'"' -f6) )
        unset IFS
#..........................continued 20 times or so
fi

I've been making so many adjustments to the code above that this second half has to be completed afterwards; but it did work 90% before. The only problem was that it would name some files S01E05.ahref=.mkv if they were hyperlinked on the wikipedia page (because of cut).

## Parse filename for season/episode descriptor
## Rename file with season/episode and name from wikipedia database
for file in *
do
    name=$(ls "$file" | grep -o "S[0-9][0-9]E[0-9][0-9]")
    episode=$(ls "$file" | grep -o "E[0-9][0-9]")
        if [ "$episode" == 'E01' ]; then
                mv "$file" "$name.${listing[0]}.mkv"
                ln -s "$name.${listing[0]}.mkv" "$file"
                echo "Renamed '$file' and created a symbolic link."
        #..........................continued
        fi
done

Let me suggest a multi-platform web-scraping CLI that doesn't seem to get enough attention: xidel

It supports the Xpath 2, CSS 3, XQuery 1, JSONiq query languages.

Used with a simplified version of the scenario at hand, we get the following, which hopefully gives you a sense of how much easier scraping becomes:

#!/usr/bin/env bash

# Example values.
show='Archer'
season=2

# Synthesize the URL to scrape.
url="http://en.wikipedia.org/wiki/List_of_${show}_episodes"

# The XPath expression for extracting the specified season's episode titles
qryEpisodeTitles="//*[matches(@id, '^Season_$season')]
/../following-sibling::table[1]//td[@class='summary']"

# Scrape the page at the URL and read all episode titles 
# (including enclosing " chars.) into an array.
IFS=$'\n' read -d '' -ra episodeTitles <<<"$(xidel -e "$qryEpisodeTitles" "$url")"

# Enumerate all episode titles with an index.
# Note: Typically, episodes are enclosed in literal `"` chars.; additionally,
#       after the closing `"`, they may contain footnote references, such as
#       `[2]` or `†`, so some cleaning-up may be required.
i=0
for episodeTitle in "${episodeTitles[@]}"; do 
  echo "Episode $((++i)): $episodeTitle"
done

Agree with the comments, that bash isn't the way to go when parsing webpages or html. But if you've already started and want to do it in bash then it's not impossible. Looking at your code, I Like your use of bash substitution and globing, but a bit confused on how it all goes together, so wrote a simple version of my own you can hopefully interpolate or work off of.

#!/bin/bash

show="Archer"
url="http://en.wikipedia.org/wiki/List_of_${show}_episodes"

while read line; do
  [[ $line =~ "<h3><span class=\"mw-headline\" id=\"Season" ]] && episode= && ((
  if [[ $line =~ "<td class=\"summary\" style=\"text-align: left;\">\""(.*)"\""

    title="${BASH_REMATCH[1]}"
    [[ "$title" =~ "title=\""(.*)"\"" ]] && title="${BASH_REMATCH[1]}"
    title="${title%%\"*}"
    title="$(echo ${title/($show)/})"

    echo "Season [$season] Episode [$((episode+=1))] Title [$title]"
  fi
done < <(wget -qO- "$url")

Example Output: (also tested with scrubs and simpsons getting correct results)

Season [1] Episode [1] Title [Mole Hunt]
Season [1] Episode [2] Title [Training Day]
Season [1] Episode [3] Title [Diversity Hire]
Season [1] Episode [4] Title [Killing Utne]
Season [1] Episode [5] Title [Honeypot]
Season [1] Episode [6] Title [Skorpio]
Season [1] Episode [7] Title [Skytanic]
Season [1] Episode [8] Title [The Rock]
Season [1] Episode [9] Title [Job Offer]
Season [1] Episode [10] Title [Dial M for Mother]
Season [2] Episode [1] Title [Swiss Miss]
Season [2] Episode [2] Title [A Going Concern]
Season [2] Episode [3] Title [Blood Test]
Season [2] Episode [4] Title [Pipeline Fever]
Season [2] Episode [5] Title [The Double Deuce]
Season [2] Episode [6] Title [Tragical History]
Season [2] Episode [7] Title [Movie Star]
Season [2] Episode [8] Title [Stage Two]
Season [2] Episode [9] Title [Placebo Effect]
Season [2] Episode [10] Title [El Secuestro]
Season [2] Episode [11] Title [Jeu Monégasque]
Season [2] Episode [12] Title [White Nights]
Season [2] Episode [13] Title [Double Trouble]
Season [3] Episode [1] Title [Heart of Archness: Part I]
Season [3] Episode [2] Title [Heart of Archness: Part II]
Season [3] Episode [3] Title [Heart of Archness: Part III]
Season [3] Episode [4] Title [The Man from Jupiter]
Season [3] Episode [5] Title [El Contador]
Season [3] Episode [6] Title [The Limited]
Season [3] Episode [7] Title [Drift Problem]
Season [3] Episode [8] Title [Lo Scandalo]
Season [3] Episode [9] Title [Bloody Ferlin]
Season [3] Episode [10] Title [Crossing Over]
Season [3] Episode [11] Title [Skin Game]
Season [3] Episode [12] Title [Space Race]
Season [3] Episode [13] Title [Space Race]
Season [4] Episode [1] Title [Fugue and Riffs]
Season [4] Episode [2] Title [The Wind Cries Mary]
Season [4] Episode [3] Title [Legs]
Season [4] Episode [4] Title [Midnight Ron]
Season [4] Episode [5] Title [Viscous Coupling]
Season [4] Episode [6] Title [Once Bitten]
Season [4] Episode [7] Title [Live and Let Dine]
Season [4] Episode [8] Title [Coyote Lovely]
Season [4] Episode [9] Title [The Honeymooners]
Season [4] Episode [10] Title [Un Chien Tangerine]
Season [4] Episode [11] Title [The Papal Chase]
Season [4] Episode [12] Title [Sea Tunt: Part I]
Season [4] Episode [13] Title [Sea Tunt: Part II]
Season [5] Episode [1] Title [White Elephant]
Season [5] Episode [2] Title [Archer Vice: A Kiss While Dying]
Season [5] Episode [3] Title [Archer Vice: A Debt of Honor]
Season [5] Episode [4] Title [Archer Vice: House Call]
Season [5] Episode [5] Title [Archer Vice: Southbound and Down]

Explanation :

I find BASH_REMATCH useful in a lot of cases like this where you have to match substrings and don't want to figure out some crazy regex.

BASH_REMATCH
  An array variable whose members are assigned by the =~ binary operator to the [[ conditional  command.   The  element
  with  index  0  is the portion of the string matching the entire regular expression.  The element with index n is the
  portion of the string matching the nth parenthesized subexpression.  This variable is read-only.

Otherwise, main issues were as you noted, that title format can differ. So I just did another BASH_REMATCH for cases when it's a href (when it will have a title attribute), and removed the trailing text in odd cases when the episode hasn't come out yet. Maybe some other cases, but this worked on all 3 shows I tested on.

Thanks to BroSlow's help (and code!) the script is complete.

#!/bin/bash

show1=$(cd .. ; pwd)
show="${show1##*/}"
seas1="${PWD##*/}"
seas=$(echo $seas1 | grep -o "[0-9][0-9]*")
url=http://en.wikipedia.org/wiki/List_of_$show\_episodes
IFS=$'\n'

while read line; do
[[ $line =~ "<h3><span class=\"mw-headline\" id=\"Season" ]] && episode= && ((season+=1))
  if [[ $line =~ "<td class=\"summary\" style=\"text-align: left;\">\""(.*)"\"" ]]; then

  title="${BASH_REMATCH[1]}"
  [[ "$title" =~ "title=\""(.*)"\"" ]] && title="${BASH_REMATCH[1]}"
  title="${title%%\"*}"
  title="$(echo ${title/($show)/})"

  arrTitle+=( "${season}.${title}" )
  fi
done < <(wget -qO- "$url")

## Make new array of only specific season ($seas=current dirname).
## Remove # in front of name with 'cut'
for i in "${arrTitle[@]}"; do
  if [[ $i == $seas.* ]]; then 
  arrNewTitle+=( $(echo $i | cut -d '.' -f2))
  fi
done

n=-1

for file in *; do 
  $((n+=1)) 
  name=$(grep -o "S[0-9][0-9]E[0-9][0-9]" <<< "$file") 
  mv "$file" "$name.${arrNewTitle[n]}.mkv"
  ln -s "$name.${arrNewTitle[n]}.mkv" "$file"
  echo "Renamed '$file' and created a symbolic link."
done

## Remove script when done, and its symbolic link (z to be at bottom of filelist)
rm 'z_rename.sh' '..mkv'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM