简体   繁体   中英

Remove ascii characters in a string python

I want to delete special characters in the string. However, I was not successful. can you help me?

It shows two "" each, but when you print it becomes only "". Why could it be?.

Data Update:

data = [{
            "data": "0\\x1e\\x82*.extractdomain.com\\x82\\x0ctest.extractdomain.com",
            "name": "subjectAltName"
        }]

re.sub("[^\x20-\x7E]", "", data["data"])

Try this.

clean_text = ' '.join(re.findall(r"[^\W]+", text))

EDIT: or this.

custom_translation = {130: None, 22: None}
print(text.translate(custom_translation))

The post has been edited "text changed" and this solution isn't working anymore. Old text was

text = '0:\x82 test test test\x82\x16testtesttest'

Newer Solution:

custom_translation = {22: None, 49: None, 50: None, 54: None, 56: None, 92: None, 120: None}
print(text.translate(custom_translation))
txt = "0:\\x82 test test test\\x82\\x16testtesttest"
x = re.sub("\\\\(?:x16|x82)", "", txt)

As a generalization of such characters:

x = re.sub("\\\\(?:x\w\w)", "", txt)

Output:

0: test test testtesttesttest

Good to know:

In short, to match a literal backslash, one has to write '\\\\' as the RE string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal. In REs that feature backslashes repeatedly, this leads to lots of repeated backslashes and makes the resulting strings difficult to understand.

Another way is to use Python's raw string notation for regular expressions; backslashes are not handled in any special way in a string literal prefixed with 'r', so r"\\n" is a two-character string containing '' and 'n', while "\\n" is a one-character string containing a newline. Regular expressions will often be written in Python code using this raw string notation.

For more examples - Backslash Plague

the error is in the declaration of text , you double escape the \\ , so you are writing a plain \\ instead of escaping an hexadecimal char

text = '0:\x82 test test test\x82\x16testtesttest'

print(re.sub("[^\x20-\x7E]", "", text))

prints: 0: test test testtesttesttest

Try this approach

import re


def delete_punc(s):

  s1 = s.split()

  match_pattern1 = re.findall(r'[a-zA-Z]', (str(s1[0])))
  match_pattern2 = re.findall(r'[a-zA-Z]', (str(s1[1])))



  listToStr1 = ''.join([str(elem) for elem in match_pattern1])
  listToStr2 = ''.join([str(elem) for elem in match_pattern2])

  return listToStr1 + ' ' + listToStr2

print(delete_punc("He3l?/l!o W{o'r[l9\d)"))

output

Hello World

It looks as if the string contains \\x escapes which have themselves been escaped, leading to the doubled backslashes. Perhaps you received the data like this, or perhaps some earlier processing has corrupted the data. The doubled backslashes can be removed by encoding the string as bytes and then decoding with the unicode-escape codec. After this, your regex will work.

>>> s = "0\\x1e\\x82*.extractdomain.com\\x82\\x0ctest.extractdomain.com"
>>> fixed = s.encode('latin-1').decode('unicode-escape')
>>> fixed
'0\x1e\x82*.extractdomain.com\x82\x0ctest.extractdomain.com'
>>> re.sub("[^\x20-\x7E]", "", fixed)
'0*.extractdomain.comtest.extractdomain.com'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM