简体   繁体   中英

Want to extract the alphanumeric text with certain special characters using python regex

I have a following text which I want in a desired format using python regex

text = "' PowerPoint PresentationOctober 11th, 2011(Visit) to Lap Chec1Edit or delete me in ‘view’ then ’slide master’.'"

I used following code

reg = re.compile("[^\w']")
text = reg.sub(' ', text)

However it gives output as text = "'PowerPoint PresentationOctober 11th 2011 Visit to Lap Chec1Edit or delete me in â viewâ then â slide masterâ'" which is not a desired output.

My desired output should be text = '"PowerPoint PresentationOctober 11th, 2011(Visit) to Lap Chec1Edit or delete me in view then slide master.'" I want to remove special characters except following []()-,.

Rather than removing the chars, you may fix them using the right encoding:

text = text.encode('windows-1252').decode('utf-8')
// => ' PowerPoint PresentationOctober 11th, 2011Visit to Lap Chec1Edit or delete me in ‘view’ then ’slide master’.'

See the Python demo

If you want to remove them later, it will become much easier, like text.replace(''', '').replace(''', '') , or re.sub(r'['']+', '', text) .

I got the answer though it was simple as follows, thanks for replies.

reg = re.compile("[^\w'\,\.\(\)\[\]]")
text = reg.sub(' ', text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM