如何在终端解析html文本文件？

Question

I have a text file that even after removing all the html tags still contains some html codes of apostrophes and other punctuations example : 我有一个文本文件，即使删除了所有html标记后，仍然包含撇号和其他标点符号的html代码示例：

  It&#039;s  // It's

my question is how to change all of them? 我的问题是如何更改所有这些？

and I'm using a bash script under linux to get the html file 我在Linux下使用bash脚本来获取html文件

Answer 1

Alternatively, if you got lynx use it as: 或者，如果您获得了lynx ，则将其用作：

lynx -stdin -dump < file.html

The above will remove the HTML tags too, for example from this file.html 上面的代码也将删除HTML标记，例如从此file.html

<i>It&#039;s</i>
&lt;<b>&amp;</b>&#62;

prints 版画

   It's <&>

Answer 2

Using Python: 使用Python：

$ echo 'It&#039;s' | python -c 'import xmllib,sys; print(xmllib.XMLParser().translate_references(sys.stdin.read()))'
It's

Using Perl: 使用Perl：

$ echo 'It&#039;s' | perl -MHTML::Entities -pe 'decode_entities($_);'
It's

如何在终端解析html文本文件？

问题描述

2 个解决方案

解决方案1
2 2017-12-02 23:55:54

解决方案2
1 已采纳 2017-12-02 21:31:33

Using Python: 使用Python：

Using Perl: 使用Perl：

如何在终端解析html文本文件？

问题描述

2 个解决方案

解决方案1 2 2017-12-02 23:55:54

解决方案2 1 已采纳 2017-12-02 21:31:33

Using Python: 使用Python：

Using Perl: 使用Perl：

解决方案1
2 2017-12-02 23:55:54

解决方案2
1 已采纳 2017-12-02 21:31:33