I was trying to get a nice and clean representation of a string. My desired version would be ['Course Number: CLASSIC 10A | Course Name: Introduction to Greek Civilization1 | Course Unit: 4']
However, the current output is ['Course Number: CLASSIC\\xa010A | Course Name: Introduction to Greek Civilization1 | Course Unit: 4'].
Something (\\xa) is getting in the way of the first element. I will attach the part of codes below. Thanks in advance for helping me out.
all_tds = [get_tds(scrollable) for scrollable in scrollables]
def num_name_unit(list, index):
all_rows = []
num = list[index][0].get_text(strip=True)
name = str.isalnum, list[index][1].get_text(strip=True)
unit = list[index][2].get_text(strip=True)
all_rows += [('Course Number: {0} | Course Name: {1} | Course Unit: {2}'.format(num, name, unit)]
return all_rows
c = num_name_unit(all_tds[0], all_tds.index(all_tds[0]))
print(c)
As @melpomene commented the string '\\xa0' is a character - a non-breaking space... What you really need to be doing to this string is reformatting it to so called 'raw text', through the use of regex :
import re
re.sub('[^A-Za-z0-9-|:]+', ' ', str)
This is generally my preferred way of removing special characters/formatting - but how does it work... If we look with the first set of quotation marks '[^A-Za-z0-9-|:]+'
we see the first thing we state is AZ
which simply means from A to Z all in capital letters. We then get from az
all in lower case. After that we have 0-9
which shows all values from 0 to 9 and finally we have |:
which means any colons or pipes... Let's test this with a simple script:
import re
vals = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789|:'
print(vals == re.sub('[^A-Za-z0-9-|:]+', ' ', vals))
I would recommend running this code yourself to try it out but the answer you get back is True
.
Incorporating this into your script would be as simple as:
import re
all_tds = [get_tds(scrollable) for scrollable in scrollables]
def num_name_unit(list, index):
all_rows = []
num = list[index][0].get_text(strip=True)
name = str.isalnum, list[index][1].get_text(strip=True)
unit = list[index][2].get_text(strip=True)
all_rows += [('Course Number: {0} | Course Name: {1} | Course Unit: {2}'.format(num, name, unit)]
return all_rows
c = num_name_unit(all_tds[0], all_tds.index(all_tds[0]))
print(re.sub('[^A-Za-z0-9-|:]+', ' ', c))
If you encounter any other values you wish to include within your string, simple add them to the end of ^A-Za-z0-9-|:
. For example, if you wished to keep underscores as well you would simply use '[^A-Za-z0-9-|:_]+'
Hope this helped. To read more go to the regex how to section of the python3 docs.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.