简体   繁体   中英

Removing <u> character from text using Scrapy

I am using Python.org version 2.7 64 bit on Vista 64 bit to run Scrapy. I am trialing scraping some text from this webpage and have managed to get most of the text cleaned up, removing line breaks and HTML tags. However tags still seem to be included in the text output to Command Shell:

u' British Grand Prix practice results ', u'

This is from the following webpage:

http://www.bbc.co.uk/sport/0/formula1/28166984 The above string represents a hyperlink to another page. I have tried using the following regular expression to remove the 'u' tags, but it has not worked:

body = response.xpath("//p").extract()
body2 = str(body)
body3 = re.sub(r'(\\[u]|\s){2,}', ' ', body2)

Can anyone suggest a way or removing these tags? Also, if possible, can you use regular expressions to remove everything between two tags as well?

Thanks

u is only python information that this text is coded in Unicode.

You have to print text in correct way to get it without this inforamtion.

a = [ u'hello', u'world' ]

print a

[u'hello', u'world']

for x in a:
    print x

hello
world

In you situation probably body is a list of strings

print type(body)

so do this

body2 = ''

for x in body:
    body += x

print body2

or even better:

body2 = "".join(body)

print body2

As furas mentioned, it is only displaying the encoding. By default, 2.7x uses ascii, so when a string is in unicode, it is denoted with a u. You can go back and forth using unicode and encode('utf-8')

>>> a = 's'
>>> a
's'
>>> a = unicode('s')
>>> a
u's'
>>> a = a.encode('utf-8')
>>> a
's'

Here's how to do it with a list

>>> ul = []
>>> ul.append(unicode('British Grand Prix practice results'))
>>> ul.append(unicode('some other string'))
>>> ul
[u'British Grand Prix practice results', u'some other string']
>>> l = []
>>> for s in ul:
...    l.append(s.encode('utf-8'))
...
>>> l
['British Grand Prix practice results', 'some other string']
>>>

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM