简体   繁体   中英

Regular expression in bash or sed

I have a regular expression (PHP) to clean the string from file:

return  preg_replace('/[^A-Za-z0-9  \n \)\(\,\%\\@\!?\#\&\;\'\"\-\+.\/"]/','', $string);

I'm using Ubuntu and want to clean the file content using bash or sed? How can I do this? Thanks!

Remove non-ASCII characters

You appear to simply want to strip out non-ASCII characters (though you're missing each of $*:<=>[]^_`{|}~ and I don't know if that's intentional). There are several ways to do this, including a command written for this express purpose.

  • strings FILENAME
  • tr -cd '[\\t\\r\\n -~]' < FILENAME
  • sed 's/[^\\t\\r\\n -~]//g' FILENAME

The strings utility does this automatically and is great for quickly checking the contents of a binary file with safe output for the terminal. You may dislike the way it separates blocks of text with line breaks.

The other two commands take a list of characters (including ranges by character code) and removes them. In tr (short for "translate"), the -c option gets the complement of the list and the -d means delete matches rather than translating them. In sed (short for "streamline editor"), I'm running a s/// substitution on an inverted character set like the one you used in your PHP code and replacing each match (the /g flag matches lobally) with an empty string. 逐个匹配)空字符串。

The character set (okay, technically that's not the right term for tr usage, eg you can't negate it like [^…] , but that's why we use tr -c ) calls out a few white space characters (tab, carriage return, line feed) and then specifies the range of characters from space ( ) to tilde ( ~ ), covered by the codes U+0020 to U+007e.

You may run across [!-~] as well. That's shorthand for all printable ASCII characters. Spaces are not printable, which is why I had to name them explicitly, though at least the space character (U+0020) immediately precedes exclamation ( ! , U+0021) so I could just lump that into our range.

Remove just your listed characters

This requires preserving the list, though I can collapse it taking advantage of any contiguous character codes:

sed 's/[^\t\r\n -#%-)+-9;?-Z\\a-z]//g' FILENAME

Explanation of above regex . Compare it to your regex or to the more comprehensive non-ASCII regex from the previous section (I added Latin-1 Supplemental to that last link's test set so you can see that it actually matches something).

In place

If you want to save to the same file, you can run sed -i COMMAND FILENAME using either of the s/// commands listed above.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM