Bash sed 命令问题

Question

我正在尝试进一步解析我使用附加 grep 命令生成的 output 文件。 我目前使用的代码是：

##!/bin/bash

# fetches the links of the movie's imdb pages for a given actor

# fullname="USER INPUT"
read -p "Enter fullname: " fullname

if [ "$fullname" = "Charlie Chaplin" ];
code="nm0000122"
then
code="nm0000050"
fi


curl "https://www.imdb.com/name/$code/#actor" | grep -Eo 
'href="/title/[^"]*' | sed 's#^.*href=\"/#https://www.imdb.com/#g' | 
sort -u > imdb_links.txt

#parses each of the link in the link text file and gets the details for 
each of the movie. THis is followed by the cleaning process
for i in $(cat imdb_links.txt) 
do 
   curl $i | 
   html2text | 
   sed -n '/Sign_In/,$p'|  
   sed -n '/YOUR RATING/q;p' | 
   head -n-1 | 
   tail -n+2 
done > imdb_all.txt

生成的样本 output 是：

EN
⁰
    * Fully supported
    * English (United States)
    * Partially_supported
    * FranÃ§ais (Canada)
    * FranÃ§ais (France)
    * Deutsch (Deutschland)
    * à¤¹à¤¿à¤‚à¤¦à¥€ (à¤à¤¾à¤°à¤¤)
    * Italiano (Italia)
    * PortuguÃªs (Brasil)
    * EspaÃ±ol (EspaÃ±a)
    * EspaÃ±ol (MÃ©xico)
****** Duck Soup ******
    * 19331933
    * Not_RatedNot Rated
    * 1h 9m
IMDb RATING
7.8/10

我如何更改代码以进一步解析 output 以仅获取从电影标题到 imdb 评级的数据（在本例中，包含标题“Duck Soup”的行到最后。

Answer 1

这是代码：

#!/bin/bash

# fullname="USER INPUT"
read -p "Enter fullname: " fullname

if [ "$fullname" = "Charlie Chaplin" ]; then
  code="nm0000122"
else
  code="nm0000050"
fi

rm -f imdb_links.txt

curl "https://www.imdb.com/name/$code/#actor" |
  grep -Eo 'href="/title/[^"]*' |
  sed 's#^href="#https://www.imdb.com#g' |
  sort -u |
while read link; do
   # uncomment the next line to save links into file:
   #echo "$link" >>imdb_links.txt

   curl "$link" |
     html2text -utf8 |
     sed -n '/Sign_In/,/YOUR RATING/ p' |
     sed -n '$d; /^\*\{6\}.*\*\{6\}$/,$ p'
done >imdb_all.txt

Answer 2

请（！）查看以下网址，了解为什么将 HTML 与sed解析是一个非常糟糕的主意：

您尝试做的事情可以使用 HTML/XML/JSON 解析器xidel 完成，只需调用 1 次！
在本例中，我将使用查理·卓别林的 IMDB作为来源。

提取所有 94 个“演员”IMDB 电影网址：

$ xidel -s "https://www.imdb.com/name/nm0000122" -e '
  //div[@id="filmo-head-actor"]/following-sibling::div[1]//a/@href
'
/title/tt0061523/?ref_=nm_flmg_act_1
/title/tt0050598/?ref_=nm_flmg_act_2
/title/tt0044837/?ref_=nm_flmg_act_3
[...]
/title/tt0004288/?ref_=nm_flmg_act_94

无需将这些保存到文本文件中。 只需使用-f ( --follow ) 而不是-e ， xidel将打开所有这些。

对于单个电影网址，您可以解析 HTML 以获取所需的文本节点......

$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
  //h1,
  //div[@class="sc-94726ce4-3 eSKKHi"]/ul/li[1]/span,
  //div[@class="sc-94726ce4-3 eSKKHi"]/ul/li[3],
  (//div[@class="sc-7ab21ed2-2 kYEdvH"])[1]
'
A Countess from Hong Kong
1967
2h
6.0/10

...但是对于那些class -names 我会说这是一个相当脆弱的努力。 相反，我建议在<script>节点内解析 HTML 源顶部的 JSON：

$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
  parse-json(//script[@type="application/ld+json"])/(
    name,
    datePublished,
    duration,
    aggregateRating/ratingValue
  )
'
A Countess from Hong Kong
1967-03-15
PT2H
6

...或获得与上述类似的 output：

$ xidel -s "https://www.imdb.com/title/tt0061523/?ref_=nm_flmg_act_1" -e '
  parse-json(//script[@type="application/ld+json"])/(
    name,
    year-from-date(date(datePublished)),
    substring(lower-case(duration),3),
    format-number(aggregateRating/ratingValue,"#.0")||"/10"
  )
'
A Countess from Hong Kong
1967
2h
6.0/10

所有组合：

$ xidel -s "https://www.imdb.com/name/nm0000122" \
  -f '//div[@id="filmo-head-actor"]/following-sibling::div[1]//a/@href' \
  -e '
    parse-json(//script[@type="application/ld+json"])/(
      name,
      year-from-date(date(datePublished)),
      substring(lower-case(duration),3),
      format-number(aggregateRating/ratingValue,"#.0")||"/10"
    )
  '
A Countess from Hong Kong
1967
2h
6.0/10
A King in New York
1957
1h50m
7.0/10
Limelight
1952
2h17m
8.0/10
[...]
Making a Living
1914
11m
5.5/10

Bash sed 命令问题

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-04-11 19:40:56

解决方案2
1 2022-04-23 23:58:58

Bash sed 命令问题

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-04-11 19:40:56

解决方案2 1 2022-04-23 23:58:58

解决方案1
1 已采纳 2022-04-11 19:40:56

解决方案2
1 2022-04-23 23:58:58