[英]wget parsing of HTML title in bash failing on 404 errors
I'm building a GSA keyword list. 我正在建立GSA关键字清单。 I have a list of keywords, and the urls they are supposed to link to. 我有一个关键字列表,以及应该链接的URL。 I need to come up with a list of titles for the links. 我需要提供链接的标题列表。 The best place I can come up with is the title tag in the HTML. 我能想到的最好的地方是HTML中的title标签。
Given a list formatted like this: 给定这样格式化的列表:
bash,PhraseMatch,http://stackoverflow.com/questions/tagged/bash,
html,PhraseMatch,http://stackoverflow.com/questions/tagged/html,
carreers,PhraseMatch,http://careers.stackoverflow.com/faq,
I want a list like this: 我想要这样的清单:
bash,PhraseMatch,http://stackoverflow.com/questions/tagged/bash,Newest 'bash' Questions
html,PhraseMatch,http://stackoverflow.com/questions/tagged/html,Newest 'html' Questions
carreers,PhraseMatch,http://careers.stackoverflow.com/faq,Stack Overflow Carreers 2.0
All it is doing is looking up the URL, getting the title tag, and appending it to the end of the line. 它要做的就是查找URL,获取title标记,并将其附加到行尾。 Here is what I have so far: 这是我到目前为止的内容:
{
for line in $( cut -d ',' -f 3 input.csv );
{
wget --no-check-certificate --quiet -O - $line \
| paste -sd ' ' - \
| grep -o -e '<head[^>]*>.*</head>' \
| grep -o -e '<title>.*</title>' \
| cut -d '>' -f 2 \
| cut -d '<' -f 1 \
| cut -d '-' -f 1 \
| tr -d ' ' \
| sed 's| *\(.*\)|\1|g' \
| paste -s -d '\n' - \
;
}
} | paste -d ' ' input.csv - > output.csv
The problem I am having is that some of the pages are returning various errors. 我遇到的问题是某些页面返回各种错误。 In that case, I get no data back. 在那种情况下,我没有任何数据回来。 This results in no line being generated. 这样就不会产生任何行。 When I do the paste to merge the two streams, they aren't the same size. 当我粘贴合并两个流时,它们的大小不相同。
I'm looking for a way to check for empty data and return an empty line. 我正在寻找一种检查空数据并返回空行的方法。 Help? 救命?
Ignoring the issue parsing HTML using a collection of command-line tools, you can substitute a fixed error string for the output of commands that don't complete. 忽略使用命令行工具集合解析HTML的问题,您可以将固定的错误字符串替换为未完成的命令的输出。 ( I don't think I'm inserting the check at the right place in the pipeline, but hopefully you can make that correction): (我认为我不会在管道的正确位置插入支票,但希望您可以进行更正):
set -o pipefail
while IFS=, read first second line rest; do
wget --no-check-certificate --quiet -O - $line |
paste -sd ' ' - |
grep -o -e '<head[^>]*>.*</head>' |
grep -o -e '<title>.*</title>' |
cut -d '>' -f 2 |
cut -d '<' -f 1 |
cut -d '-' -f 1 |
tr -d ' ' |
sed 's| *\(.*\)|\1|g' |
paste -s -d '\n' - \
|| echo "<no output found>" # If any part of the pipeline fails
done < input.csv | paste -d ' ' input.csv - > output.csv
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.