简体   繁体   中英

How to remove slash, letters and numbers from a string?

I was trying to get a nice and clean representation of a string. My desired version would be ['Course Number: CLASSIC 10A | Course Name: Introduction to Greek Civilization1 | Course Unit: 4']

However, the current output is ['Course Number: CLASSIC\\xa010A | Course Name: Introduction to Greek Civilization1 | Course Unit: 4'].

Something (\\xa) is getting in the way of the first element. I will attach the part of codes below. Thanks in advance for helping me out.

all_tds = [get_tds(scrollable) for scrollable in scrollables]
def num_name_unit(list, index):
    all_rows = []
    num = list[index][0].get_text(strip=True)
    name = str.isalnum, list[index][1].get_text(strip=True)
    unit = list[index][2].get_text(strip=True)
    all_rows += [('Course Number: {0} | Course Name: {1} | Course Unit: {2}'.format(num, name, unit)]
    return all_rows
c = num_name_unit(all_tds[0], all_tds.index(all_tds[0]))
print(c)

As @melpomene commented the string '\\xa0' is a character - a non-breaking space... What you really need to be doing to this string is reformatting it to so called 'raw text', through the use of regex :

import re
re.sub('[^A-Za-z0-9-|:]+', ' ', str)

This is generally my preferred way of removing special characters/formatting - but how does it work... If we look with the first set of quotation marks '[^A-Za-z0-9-|:]+' we see the first thing we state is AZ which simply means from A to Z all in capital letters. We then get from az all in lower case. After that we have 0-9 which shows all values from 0 to 9 and finally we have |: which means any colons or pipes... Let's test this with a simple script:

import re
vals = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789|:'
print(vals == re.sub('[^A-Za-z0-9-|:]+', ' ', vals))

I would recommend running this code yourself to try it out but the answer you get back is True .

Incorporating this into your script would be as simple as:

import re
all_tds = [get_tds(scrollable) for scrollable in scrollables]
def num_name_unit(list, index):
    all_rows = []
    num = list[index][0].get_text(strip=True)
    name = str.isalnum, list[index][1].get_text(strip=True)
    unit = list[index][2].get_text(strip=True)
    all_rows += [('Course Number: {0} | Course Name: {1} | Course Unit: {2}'.format(num, name, unit)]
    return all_rows
c = num_name_unit(all_tds[0], all_tds.index(all_tds[0]))
print(re.sub('[^A-Za-z0-9-|:]+', ' ', c))

If you encounter any other values you wish to include within your string, simple add them to the end of ^A-Za-z0-9-|: . For example, if you wished to keep underscores as well you would simply use '[^A-Za-z0-9-|:_]+' Hope this helped. To read more go to the regex how to section of the python3 docs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM