简体   繁体   中英

Understanding sed/awk ^ , and [ ]

First, is there a better command line Linux way of processing text from an HTML page downloaded with wget, than with sed and awk? If so, please tell me or link to doc.

Second, I'm confused about the following expression...since ^ searches from beginning of new line, why does blank gsub(/[^az]]*/, " ") replace non-letter characters, and what does the comma , do here? And why is there an unmatched ] in this expression?

For the processing of HTML, you need to describe what you want to do with the processing.

The ^ character serves as a 'beginning of line' when it is not in a character class and could be indicating the start of a line (eg if the regex is /^[^az]/ ). When it is inside a character class (enclosed in square brackets, [] ) and is the first character, then it is a metacharacter meaning 'anything except the following characters'.

The gsub function is a global search and replace operation:

gsub(/[^a-z]]*/, " ")

means 'replace anything that is not in az and followed by zero or more close square brackets with a blank (the string in the double quotes " " ). The comma is an argument separator, separating the regex argument from the replacement string argument. The second close square bracket in the regex is surprising; it could easily be a mistake.

Because there is no third argument to the gsub function, it operates on $0 , the current input line.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM