简体   繁体   English

了解sed / awk ^和[]

[英]Understanding sed/awk ^ , and [ ]

First, is there a better command line Linux way of processing text from an HTML page downloaded with wget, than with sed and awk? 首先,是否有比使用sed和awk更好的命令行Linux处理从wget下载的HTML页面中的文本的方法? If so, please tell me or link to doc. 如果是这样,请告诉我或链接到doc。

Second, I'm confused about the following expression...since ^ searches from beginning of new line, why does blank gsub(/[^az]]*/, " ") replace non-letter characters, and what does the comma , do here? 其次,我对以下表达式感到困惑...因为^从新行的开头开始搜索,为什么空白gsub(/[^az]]*/, " ")替换非字母字符,逗号是什么,在这里吗? And why is there an unmatched ] in this expression? 为什么在此表达式中有不匹配的]

For the processing of HTML, you need to describe what you want to do with the processing. 对于HTML的处理,您需要描述要处理的内容。

The ^ character serves as a 'beginning of line' when it is not in a character class and could be indicating the start of a line (eg if the regex is /^[^az]/ ). ^字符不在字符类中时,它可以用作“行的开始”,并且可以指示行的开始(例如,如果正则表达式为/^[^az]/ )。 When it is inside a character class (enclosed in square brackets, [] ) and is the first character, then it is a metacharacter meaning 'anything except the following characters'. 当它在字符类内(括在方括号[] )并且是第一个字符时,则它是一个元字符,表示“除以下字符外的任何字符”。

The gsub function is a global search and replace operation: gsub函数是全局搜索和替换操作:

gsub(/[^a-z]]*/, " ")

means 'replace anything that is not in az and followed by zero or more close square brackets with a blank (the string in the double quotes " " ). 意思是“用z代替任何不在az中的内容,然后用零个或多个接近的方括号替换为空白(双引号中的字符串" " )。 The comma is an argument separator, separating the regex argument from the replacement string argument. 逗号是参数分隔符,用于将正则表达式参数与替换字符串参数分隔开。 The second close square bracket in the regex is surprising; 正则表达式中的第二个方括号令人惊讶; it could easily be a mistake. 这很容易是一个错误。

Because there is no third argument to the gsub function, it operates on $0 , the current input line. 因为gsub函数没有第三个参数,所以它将在当前输入行$0上操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM