[英]how to remove unused html codes from the file using Unix
We have a HTML source which will be processed using a informatica workflow. 我们有一个HTML源,它将使用informatica工作流程进行处理。 In between these two we have a Unix script which transforms the file.
在这两者之间,我们有一个用于转换文件的Unix脚本。
We are getting an error from past week in the informatica saying invalid format, because the file has unused html reference (0-8,14-31 etc) 过去一周,我们在informatica中收到一条错误消息,指出格式无效,因为该文件具有未使用的html引用(0-8、14-31等)
example: 例:
� -  Unused
 -  Unused
 -  Unused
 - Ÿ Unused
We need to handle it in Unix and remove the above mentioned characters from the HTML file before processing it. 我们需要在Unix中处理它,并在处理它之前从HTML文件中删除上述字符。
I have tried using sed command like 我曾尝试使用sed命令,例如
sed -e 's/\&\([^\amp;|^\apos;|^\quot;|^\lt;|^\gt;]\)/\&\1/g'
but it is not serving the purpose. 但这没有达到目的。 Also, since we have soo many unused reference, it cannot be hardcoded also.
另外,由于我们有太多未使用的引用,因此也无法对其进行硬编码。
Could you please let me know how to proceed with this? 您能否让我知道如何进行此操作?
Here is a working (bash) solution by treating encoded characters as strings. 这是通过将编码的字符视为字符串的有效解决方案。 Unclear if your source is encoded or not but works if so :
不清楚您的来源是否经过编码,但可以这样工作:
sed 's/'`for n in {00..08} {11..12} {14..31} {127..159}; do echo -n "&#"$n";\|"; done`'//g'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.