I know similar questions exist for this topic but I've gone through them and still couldn't get it.
My python program retrieves a subsection of html from a page using a regular expression. I just realised that I hadn't accounted for html special characters getting in the way.
say I have:
regex_title = ['I went to the store', 'Itlt'sa nice day today', 'I went home for a rest']
I obviously want to change lt'
to a single quote '.
I've tried variations of:
for each in regex_title:
if 'lt'' in regex_title:
str.replace("lt'", "'")
but had no success. What am I missing.
NOTE: The purpose is to do this without importing any more modules.
str.replace
does not replace in-place. It returns the replaced string. You need to assigned back the return value.
>>> regex_title = ['I went to the store', 'Itlt's a nice day today',
... 'I went home for a rest']
>>> regex_title = [s.replace("lt'", "'") for s in regex_title]
>>> regex_title
['I went to the store', "It's a nice day today", 'I went home for a rest']
If your task is to unescape HTML, then better use unescape
function:
>>> ll = ['I went to the store', 'Itlt's a nice day today', 'I went home for a rest']
>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> print map(h.unescape, ll)
['I went to the store', u"Itlt's a nice day today", 'I went home for a rest']
You need to change your code to this :
for each in regex_title:
if 'lt'' in each:
each.replace("lt'", "'")
But it doesn't change your list so you need to pass the replaced index to list:
>>> for each in regex_title:
... if 'lt'' in each:
... regex_title[regex_title.index(each)]=each.replace("lt'", "'")
...
>>> regex_title
['I went to the store', "It's a nice day today", 'I went home for a rest']
>>>
You don't explain why you want to avoid importing standard library modules. There are very few good reasons to deny yourself the use of Python's included batteries; unless you have such a reason (and if you do, you should state it), you should use the functionality provided to you.
In this case, it's the unescape()
function from the html
module: 1
from html import unescape
titles = [
'I went to the store',
'It's a nice day today',
'I went home for a rest'
]
fixed = [unescape(s) for s in titles]
>>> fixed
['I went to the store', "It's a nice day today", 'I went home for a rest']
Reimplementing html.unescape()
yourself is
1 Since Python 3.4, anyway. For previous versions, use HTMLParser.HTMLParser.unescape()
as per @stalk's answer .
Instead of doing this yourself, you'd be better off using the HTMLParser
library, as described in https://stackoverflow.com/a/2087433/2314532 . Read that question and answer for all the details, but the summary is:
import HTMLParser
parser = HTMLParser.HTMLParser()
print parser.unescape(''')
# Will print a single ' character
So in your case, you'd want to do something like:
import HTMLParser
parser = HTMLParser.HTMLParser()
new_titles = [parser.unescape(s) for s in regex_title]
That will unescape any HTML escape, not just the '
escape that you asked about, and process the entire list all at once.
Try like this:-
regex_title = ['I went to the store', 'Itlt's a nice day today', 'I went home for a rest']
str=','.join(regex_title)
str1=str.replace("lt'","'");
print str1.split()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.