[英]Extract Title of a html file using grep
cat 1.html | grep "<title>" > title.txt
This grep statement is not working. 这个grep语句不起作用。
Please tell the best way to grab the title of a page using grep or sed. 请告诉使用grep或sed获取页面标题的最佳方法。
Thanks. 谢谢。
sed -n 's/<title>\(.*\)<\/title>/\1/Ip' 1.html
使用-n和p的组合仅打印匹配
you can use awk. 你可以使用awk。 This works even for multiline
这甚至适用于多线
$ cat file
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>Extract Title of a html file
using grep - Stack Overflow</title>
<link rel="stylesheet" type="text/css" href="http://sstatic.net/stackoverflow/all.css?v=9ea1a272f146">
$ awk -vRS="</title>" '/<title>/{gsub(/.*<title>|\n+/,"");print;exit}' file
Extract Title of a html file using grep - Stack Overflow
您可以使用XML::Twig
Perl包中的xml_grep:
xml_grep --text_only title 1.html
grep "<title>" /path/to/html.html
Works fine for me. 对我来说很好。 Are you sure 1.html is in your current working directory?
你确定1.html在你当前的工作目录中吗?
pwd
to check. pwd
检查。
Alex Hovansky's answer is good enough, although there is a chance that html is not well formed and your xml_grep would crash Alex Hovansky的答案已经足够好了,虽然html可能没有很好地形成,你的xml_grep会崩溃
I recommend use tidy to convert html to xml, then use xml_grep 我建议使用tidy将html转换为xml,然后使用xml_grep
tidy -asxml -utf8 html_file.html > out.xml
xml_grep 'xpath_expression' out.xml
cat 1.html | grep -oE "<title>.*</title>" | sed 's/<title>//' | sed 's/<\/title>//'
Grep用-oE只提取标题标签,然后sed删除html标签
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.