简体   繁体   English

Python 中带有特殊字符的字符串显示不正确

[英]String with special characters in Python do not appear correctly

I have parsed some text (names of cities) from a website using BeautifulSoup into a list, however faced a problem that I cannot overcome.我已经使用 BeautifulSoup 将网站上的一些文本(城市名称)解析成一个列表,但是遇到了一个我无法克服的问题。 The text elements on the website had special characters, and when I print the list the city names are being shown as [u'London] and instead of the special characters there are numbers and letters appear.网站上的文本元素有特殊字符,当我打印列表时,城市名称显示为 [u'London] 并且出现数字和字母而不是特殊字符。 How can I get rid of the 'u' at the beginning and convert the text to the format as originally appears on the website?如何去掉开头的 'u' 并将文本转换为最初出现在网站上的格式?

Here is the code:这是代码:

import urllib2
from bs4 import BeautifulSoup

address = 'https://clinicaltrials.gov/ct2/show/NCT02226120?resultsxml=true'

page = urllib2.urlopen(address)
soup = BeautifulSoup(page)
locations = soup.findAll('country', text="Hungary")
for city_tag in locations:
    site=city_tag.parent.name
    if site=="address":
        desired_city=str(city_tag.findPreviousSibling('city').contents)
        print desired_city

and here is what I get as output:这是我得到的输出:

[u'Pecs']
[u'Baja']
[u'Balatonfured']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Budapest']
[u'Cegled']
[u'Debrecen']
[u'Eger']
[u'Hodmezovasarhely']
[u'Miskolc']
[u'Nagykanizsa']
[u'Nyiregyh\xe1za']
[u'Pecs']
[u'Sopron']
[u'Szeged']
[u'Szekesfehervar']
[u'Szekszard']
[u'Zalaegerszeg']

The 7th element from the bottom [u'Nyiregyh\\xe1za'] for example do not appear correctly.例如,底部的第 7 个元素 [u'Nyiregyh\\xe1za'] 显示不正确。

You used str() to convert the object you have so it can be printed:您使用str()转换您拥有的对象,以便可以打印它:

    desired_city=str(city_tag.findPreviousSibling('city').contents)
    print desired_city

Not only do you see the 'u' prefix that you asked about, but you also see [] and '' .您不仅会看到您询问的 'u' 前缀,还会看到[]'' Those punctuation are part of how those types of objects are converted to text by str() : the [] indicates that you have a list object.这些标点符号是这些类型的对象如何通过str()转换为文本的一部分: []表示您有一个列表对象。 The u'' indicates that the object in the list is "text". u''表示列表中的对象是“文本”。 Note: Python 2 is quite sloppy in its handling of bytes versus characters.注意:Python 2 在处理字节和字符方面相当草率。 This sloppiness confuses many people especially because sometimes it appears to work even when it is wrong and fails with other data or environments.这种草率让很多人感到困惑,尤其是因为有时即使它是错误的并且在其他数据或环境中失败时它似乎也能工作。

Since you have a list containing a unicode object, you want to print that value:由于您有一个包含 unicode 对象的列表,您希望打印该值:

    list_of_cities = city_tag.findPreviousSibling('city').contents
    desired_city = list_of_cities[0]
    print desired_city

Note that I assume the list of cities will have at least one element.请注意,我假设城市列表至少包含一个元素。 The sample output you show is that way, but it would be good to check for error conditions too.您显示的示例输出就是这种方式,但检查错误情况也会很好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM