简体   繁体   中英

String with special characters in Python do not appear correctly

I have parsed some text (names of cities) from a website using BeautifulSoup into a list, however faced a problem that I cannot overcome. The text elements on the website had special characters, and when I print the list the city names are being shown as [u'London] and instead of the special characters there are numbers and letters appear. How can I get rid of the 'u' at the beginning and convert the text to the format as originally appears on the website?

Here is the code:

import urllib2
from bs4 import BeautifulSoup

address = 'https://clinicaltrials.gov/ct2/show/NCT02226120?resultsxml=true'

page = urllib2.urlopen(address)
soup = BeautifulSoup(page)
locations = soup.findAll('country', text="Hungary")
for city_tag in locations:
    site=city_tag.parent.name
    if site=="address":
        desired_city=str(city_tag.findPreviousSibling('city').contents)
        print desired_city

and here is what I get as output:

[u'Pecs']
[u'Baja']
[u'Balatonfured']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Cegled']
[u'Debrecen']
[u'Eger']
[u'Hodmezovasarhely']
[u'Miskolc']
[u'Nagykanizsa']
[u'Nyiregyh\xe1za']
[u'Pecs']
[u'Sopron']
[u'Szeged']
[u'Szekesfehervar']
[u'Szekszard']
[u'Zalaegerszeg']

The 7th element from the bottom [u'Nyiregyh\\xe1za'] for example do not appear correctly.

You used str() to convert the object you have so it can be printed:

    desired_city=str(city_tag.findPreviousSibling('city').contents)
    print desired_city

Not only do you see the 'u' prefix that you asked about, but you also see [] and '' . Those punctuation are part of how those types of objects are converted to text by str() : the [] indicates that you have a list object. The u'' indicates that the object in the list is "text". Note: Python 2 is quite sloppy in its handling of bytes versus characters. This sloppiness confuses many people especially because sometimes it appears to work even when it is wrong and fails with other data or environments.

Since you have a list containing a unicode object, you want to print that value:

    list_of_cities = city_tag.findPreviousSibling('city').contents
    desired_city = list_of_cities[0]
    print desired_city

Note that I assume the list of cities will have at least one element. The sample output you show is that way, but it would be good to check for error conditions too.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM