简体   繁体   English

wget在bash中解析HTML标题失败404错误

[英]wget parsing of HTML title in bash failing on 404 errors

I'm building a GSA keyword list. 我正在建立GSA关键字清单。 I have a list of keywords, and the urls they are supposed to link to. 我有一个关键字列表,以及应该链接的URL。 I need to come up with a list of titles for the links. 我需要提供链接的标题列表。 The best place I can come up with is the title tag in the HTML. 我能想到的最好的地方是HTML中的title标签。

Given a list formatted like this: 给定这样格式化的列表:

bash,PhraseMatch,http://stackoverflow.com/questions/tagged/bash,
html,PhraseMatch,http://stackoverflow.com/questions/tagged/html,
carreers,PhraseMatch,http://careers.stackoverflow.com/faq,

I want a list like this: 我想要这样的清单:

bash,PhraseMatch,http://stackoverflow.com/questions/tagged/bash,Newest 'bash' Questions
html,PhraseMatch,http://stackoverflow.com/questions/tagged/html,Newest 'html' Questions
carreers,PhraseMatch,http://careers.stackoverflow.com/faq,Stack Overflow Carreers 2.0

All it is doing is looking up the URL, getting the title tag, and appending it to the end of the line. 它要做的就是查找URL,获取title标记,并将其附加到行尾。 Here is what I have so far: 这是我到目前为止的内容:

{
for line in $( cut -d ',' -f 3 input.csv );
{
    wget --no-check-certificate --quiet -O - $line \
    | paste -sd ' ' - \
    | grep -o -e '<head[^>]*>.*</head>' \
    | grep -o -e '<title>.*</title>' \
    | cut -d '>' -f 2 \
    | cut -d '<' -f 1 \
    | cut -d '-' -f 1 \
    | tr -d '   ' \
    | sed 's| *\(.*\)|\1|g' \
    | paste -s -d '\n' - \
    ;
}
} | paste -d ' ' input.csv - > output.csv

The problem I am having is that some of the pages are returning various errors. 我遇到的问题是某些页面返回各种错误。 In that case, I get no data back. 在那种情况下,我没有任何数据回来。 This results in no line being generated. 这样就不会产生任何行。 When I do the paste to merge the two streams, they aren't the same size. 当我粘贴合并两个流时,它们的大小不相同。

I'm looking for a way to check for empty data and return an empty line. 我正在寻找一种检查空数据并返回空行的方法。 Help? 救命?

Ignoring the issue parsing HTML using a collection of command-line tools, you can substitute a fixed error string for the output of commands that don't complete. 忽略使用命令行工具集合解析HTML的问题,您可以将固定的错误字符串替换为未完成的命令的输出。 ( I don't think I'm inserting the check at the right place in the pipeline, but hopefully you can make that correction): (我认为我不会在管道的正确位置插入支票,但希望您可以进行更正):

set -o pipefail
while IFS=, read first second line rest; do
    wget --no-check-certificate --quiet -O - $line | 
      paste -sd ' ' - |
      grep -o -e '<head[^>]*>.*</head>' |
      grep -o -e '<title>.*</title>' |
      cut -d '>' -f 2 |
      cut -d '<' -f 1 |
      cut -d '-' -f 1 |
      tr -d '   ' | 
      sed 's| *\(.*\)|\1|g' | 
      paste -s -d '\n' - \
  || echo "<no output found>"   # If any part of the pipeline fails
 done < input.csv | paste -d ' ' input.csv - > output.csv

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM