简体   繁体   English

python:删除unicode字符

[英]python:remove unicode characters

import sys
import nltk
import unicodedata
import pymongo
conn = pymongo.Connection('mongodb://localhost:27017')

 def jd_extract():
    try:
        iter = collection.find({},limit=1)
        for item in iter:
             return (item['jd'])


res=jd_extract()
print res

prints 版画

[u'Software Engineer II', , u' ', , u' ', , u' ', Skills: C#,WPF,SQL , u' ', , u' ', Experience: 3-4.5 Yrs , u' ', , u' ', Job Location:- Gurgaon/Noida , u' ', , u' ', Job Summary: , u' ', The Software Engineer II's role is to develop and manage the application code for a system or part of a project. The Software Engineer II role typically has skills to work with multiple platforms and/or services. , u' ',   , u' ',   , u' \xa0',  , u' ', , u' ', ][u' ', Salary: , u'\n', Not Disclosed by Recruiter , u'\n', , u'\n'][u' ', Industry: , u'\n', IT-Software / Software Services , u'\n', , u'\n'][u' ', Functional Area: , u'\n', IT Software - Application Programming, Maintenance , u'\n', , u'\n'][u' ', Role Category: , u'\n', Programming & Design , u'\n', , u'\n'][u' ', Role: , u'\n', Software Developer , u'\n', , u'\n'][u' ', Keyskills: , u'\n', wpf C# Sql Programming , u'\n', , u'\n'][u' ', Education: , u'\n', 
    UG - Any Graduate - Any Specialization, Graduation Not Required    
     PG - Any Postgraduate - Any Specialization, Post Graduation Not Required     
     Doctorate - Any Doctorate - Any Specialization, Doctorate Not Required      , u'\n', , u'\n']

I want to remove unicode characters from res. 我想从res中删除unicode字符。 I tried str(res) but not working. 我尝试了str(res),但是没有用。

try to encode the unicode strings as 'utf-8' 尝试将unicode字符串编码为'utf-8'

res =[s.encode('utf-8') for s in res]

or if you prefer for loops 或者如果您喜欢循环

ascii_strings = []
for s in res:
   ascii_strings.append(s.encode('utf-8'))

As I understand, you want to remove u'' when you print res (a list of Unicode strings). 据我了解,您在打印res (Unicode字符串列表)时要删除u'' You could print each string individually: 您可以单独打印每个字符串:

for unicode_string in res:
    print unicode_string

The reason you saw u'' is due to print some_list calling repr(item) on each item in the list and u'..' is Unicode string literal in Python: 您看到u''的原因是由于在列表中的每个项目上都print some_list调用repr(item) ,而u'..'是Python中的Unicode字符串文字:

>>> print [u'a']
[u'a']
>>> print repr(u'a')
u'a'
>>> print u'a'
a

List of str, unicode and int types str,unicode和int类型的列表

>>> item_list = [ 'a', 3, u'b', 5, u'c', 8, 'd', 13, 'e' ]
>>> print item_list
['a', 3, u'b', 5, u'c', 8, 'd', 13, 'e']

Convert unicode types to str types 将unicode类型转换为str类型

>>> item_list = [ str(item) if isinstance(item, unicode) else item for item in item_list  ]
>>> print item_list
['a', 3, 'b', 5, 'c', 8, 'd', 13, 'e']

Convert str types to unicode types 将str类型转换为unicode类型

>>> item_list = [ unicode(item) if isinstance(item, str) else item for item in item_list  ]
>>> print item_list
[u'a', 3, u'b', 5, u'c', 8, u'd', 13, u'e']

str and unicode are both subclasses of basestring str和unicode都是basestring的子类

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM