简体   繁体   中英

Meaning of Regular Expression with JavaScript and PHP

Can anybody explain me the use of this Regular Expression?

I want to truncate characters which has Ascii code less than 32 except

Horizontal Tab, Line Feed and Carriage Return.

Does below code will work accordingly? or Do I need to change it?

JavaScript Code:

var text = text.replace(/[\x00-\x09\x0A\x0D-\x2F]+/, "");

PHP Code

$val = preg_replace('/[\x00-\x09\x0A\x0D-\x2F]/', '',$val);

Edit

I want to preserve LF, HT and CR and not want to truncate them from String if any. Other characters below Ascii 32 should be Truncated.

Well, given:

  • 0x09 = tab
  • 0x12 = line feed
  • 0x15 = carriage return

Then anything but the above (and still <32 ) would look something like:

/[\x00-\x08\x10\x11\x13\x14\x16-\x1F]/

And I assume you meant an exclusive match (up to but not including 32) otherwise the last hex code should be \\x20 .


$orig   = "This is a sample document. It contains:\r\n"
        . "\t* horizontal tabs,\r\n"
        . "\t* line feeds, and\r\n"
        . "\t* carriage returns\r\n"
        . "\r\n"
        . "These characters are not to be removed. However, other characters, such as:\r\n"
        . "\r\n"
        . "\t'\x06' (ACK),\r\n"
        . "\t'\x07' (BEL),\r\n"
        . "\t'\x1B' (ESC)\r\n"
        . "\t(others)\r\n"
        . "\r\n"
        . "And other characters < ordinal 32 should be removed.";

$modif  = preg_replace('/[\x00-\x08\x10\x11\x13\x14\x16-\x1F]/', '', $orig);

echo str_repeat('=', 50) . PHP_EOL;
echo (strlen($orig) == strlen($modif) ? "Failed" : "Success") . PHP_EOL;
echo str_repeat('=', 50) . PHP_EOL;
echo PHP_EOL;
echo $modif;

Based on the $modif is shorter than $orig (by 3 characters [ \\x06 , \\x07 , \\x1B ]) but the white space characters ([ \\x09 , \\x12 , \\x15 ]) were preserved, I would say this is what you're after.

Your first question (explanation of regex)

Since your hex codes correspond to symbols (decimal less that 128) - you can use ASCII for checking what will be passed. Your regex is replacing these symbols:

0   000 00  00000000    NUL       Null char
1   001 01  00000001    SOH       Start of Heading
2   002 02  00000010    STX       Start of Text
3   003 03  00000011    ETX       End of Text
4   004 04  00000100    EOT       End of Transmission
5   005 05  00000101    ENQ       Enquiry
6   006 06  00000110    ACK       Acknowledgment
7   007 07  00000111    BEL       Bell
8   010 08  00001000    BS        Back Space
9   011 09  00001001    HT  	  Horizontal Tab
10  012 0A  00001010    LF        Line Feed

and these:

13  015 0D  00001101    CR  
      Carriage Return
14  016 0E  00001110    SO        Shift Out / X-On
15  017 0F  00001111    SI        Shift In / X-Off
16  020 10  00010000    DLE       Data Line Escape
17  021 11  00010001    DC1       Device Control 1 (oft. XON)
18  022 12  00010010    DC2       Device Control 2
19  023 13  00010011    DC3       Device Control 3 (oft. XOFF)
20  024 14  00010100    DC4       Device Control 4
21  025 15  00010101    NAK       Negative Acknowledgement
22  026 16  00010110    SYN       Synchronous Idle
23  027 17  00010111    ETB       End of Transmit Block
24  030 18  00011000    CAN       Cancel
25  031 19  00011001    EM        End of Medium
26  032 1A  00011010    SUB       Substitute
27  033 1B  00011011    ESC       Escape
28  034 1C  00011100    FS        File Separator
29  035 1D  00011101    GS        Group Separator
30  036 1E  00011110    RS        Record Separator
31  037 1F  00011111    US        Unit Separator
32  040 20  00100000                Space
33  041 21  00100001    !   !       Exclamation mark
34  042 22  00100010    "   "   "  Double quotes (or speech marks)
35  043 23  00100011    #   #       Number
36  044 24  00100100    $   $       Dollar
37  045 25  00100101    %   %       Procenttecken
38  046 26  00100110    &   &   &   Ampersand
39  047 27  00100111    '   '       Single quote
40  050 28  00101000    (   (       Open parenthesis (or open bracket)
41  051 29  00101001    )   )       Close parenthesis (or close bracket)
42  052 2A  00101010    *   *       Asterisk
43  053 2B  00101011    +   +       Plus
44  054 2C  00101100    ,   ,       Comma
45  055 2D  00101101    -   -       Hyphen
46  056 2E  00101110    .   .       Period, dot or full stop
47  057 2F  00101111    /   /       Slash or divide

to empty string.

Your second question (replace non-printables, i.e. 0-31, i.e. 0x00-0x19)

If you want to truncate all symbols (non-printable, it seems) below 32 decimal, then:

$val = preg_replace('/[\x00-\x09\x12\x14-\x19]/', '',$val); //x12 also should be restricted

(updated, preserving HT, LF, CR)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM