[英]how to parse html text file in terminal?
I have a text file that even after removing all the html tags still contains some html codes of apostrophes and other punctuations example : 我有一个文本文件,即使删除了所有html标记后,仍然包含撇号和其他标点符号的html代码示例:
It's // It's
my question is how to change all of them? 我的问题是如何更改所有这些?
and I'm using a bash script under linux to get the html file 我在Linux下使用bash脚本来获取html文件
Alternatively, if you got lynx
use it as: 或者,如果您获得了lynx
,则将其用作:
lynx -stdin -dump < file.html
The above will remove the HTML tags too, for example from this file.html
上面的代码也将删除HTML标记,例如从此file.html
<i>It's</i>
<<b>&</b>>
prints 版画
It's <&>
$ echo 'It's' | python -c 'import xmllib,sys; print(xmllib.XMLParser().translate_references(sys.stdin.read()))'
It's
$ echo 'It's' | perl -MHTML::Entities -pe 'decode_entities($_);'
It's
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.