I have a regular expression (PHP) to clean the string from file:
return preg_replace('/[^A-Za-z0-9 \n \)\(\,\%\\@\!?\#\&\;\'\"\-\+.\/"]/','', $string);
I'm using Ubuntu and want to clean the file content using bash or sed? How can I do this? Thanks!
You appear to simply want to strip out non-ASCII characters (though you're missing each of $*:<=>[]^_`{|}~
and I don't know if that's intentional). There are several ways to do this, including a command written for this express purpose.
strings FILENAME
tr -cd '[\\t\\r\\n -~]' < FILENAME
sed 's/[^\\t\\r\\n -~]//g' FILENAME
The strings
utility does this automatically and is great for quickly checking the contents of a binary file with safe output for the terminal. You may dislike the way it separates blocks of text with line breaks.
The other two commands take a list of characters (including ranges by character code) and removes them. In tr
(short for "translate"), the -c
option gets the complement of the list and the -d
means delete matches rather than translating them. In sed
(short for "streamline editor"), I'm running a s///
substitution on an inverted character set like the one you used in your PHP code and replacing each match (the /g
flag matches lobally) with an empty string. 逐个匹配)空字符串。
The character set (okay, technically that's not the right term for tr
usage, eg you can't negate it like [^…]
, but that's why we use tr -c
) calls out a few white space characters (tab, carriage return, line feed) and then specifies the range of characters from space ( ) to tilde (
~
), covered by the codes U+0020 to U+007e.
You may run across [!-~]
as well. That's shorthand for all printable ASCII characters. Spaces are not printable, which is why I had to name them explicitly, though at least the space character (U+0020) immediately precedes exclamation ( !
, U+0021) so I could just lump that into our range.
This requires preserving the list, though I can collapse it taking advantage of any contiguous character codes:
sed 's/[^\t\r\n -#%-)+-9;?-Z\\a-z]//g' FILENAME
Explanation of above regex . Compare it to your regex or to the more comprehensive non-ASCII regex from the previous section (I added Latin-1 Supplemental to that last link's test set so you can see that it actually matches something).
If you want to save to the same file, you can run sed -i COMMAND FILENAME
using either of the s///
commands listed above.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.