[英]How to read in special characters to Python
I am parsing in an XML file with special characters from foreign languages in some of the author names (í = í , ï = ï , ò = ò etc)
. 我正在解析XML文件,其中包含来自某些作者姓名的外语特殊字符(í = í , ï = ï , ò = ò etc)
。 My code gets caught up with an error "ExpatError: undefined entity:" when trying to process these characters. 尝试处理这些字符时,我的代码陷入错误“ ExpatError:未定义实体:”。 I have seen BeautifulSoup library online, but unsure how to easily implement that into my code without having to rewrite using the lxml library (if my understanding is correct). 我已经在网上看到BeautifulSoup库,但是不确定如何轻松地将其实现到我的代码中而不必使用lxml库进行重写(如果我的理解是正确的话)。 What is the best way to solve this? 解决此问题的最佳方法是什么? Cheers! 干杯!
XML data to load XML数据加载
<pub>
<ID>75</ID>
<title>Use of Lexicon Density in Evaluating Word Recognizers</title>
<year>2000</year>
<booktitle>Multiple Classifier Systems</booktitle>
<pages>310-319</pages>
<authors>
<author>Petr Slavík</author>
<author>Venu Govindaraju</author>
</authors>
</pub>
Python code Python代码
import sqlite3
con = sqlite3.connect("publications.db")
cur = con.cursor()
from xml.dom import minidom
xmldoc = minidom.parse("test.xml")
#loop through <pub> tags to find number of pubs to grab
root = xmldoc.getElementsByTagName("root")[0]
pubs = [a.firstChild.data for a in root.getElementsByTagName("pub")]
num_pubs = len(pubs)
count = 0
while(count < num_pubs):
#get data from each <pub> tag
temp_pub = root.getElementsByTagName("pub")[count]
temp_ID = temp_pub.getElementsByTagName("ID")[0].firstChild.data
temp_title = temp_pub.getElementsByTagName("title")[0].firstChild.data
temp_year = temp_pub.getElementsByTagName("year")[0].firstChild.data
temp_booktitle = temp_pub.getElementsByTagName("booktitle")[0].firstChild.data
temp_pages = temp_pub.getElementsByTagName("pages")[0].firstChild.data
temp_authors = temp_pub.getElementsByTagName("authors")[0]
temp_author_array = [a.firstChild.data for a in temp_authors.getElementsByTagName("author")]
num_authors = len(temp_author_array)
count = count + 1
#process results into sqlite
pub_params = (temp_ID, temp_title)
cur.execute("INSERT INTO publication (id, ptitle) VALUES (?, ?)", pub_params)
journal_params = (temp_booktitle, temp_pages, temp_year)
cur.execute("INSERT INTO journal (jtitle, pages, year) VALUES (?, ?, ?)", journal_params)
x = 0
while(x < num_authors):
cur.execute("INSERT OR IGNORE INTO authors (name) VALUES (?)", (temp_author_array[x],))
x = x + 1
#display results
print("\nEntry processed: ", count)
print("------------------\nPublication ID: ", temp_ID)
print("Publication Title: ", temp_title)
print("Year: ", temp_year)
print("Journal title: ", temp_booktitle)
print("Pages: ", temp_pages)
i = 0
print("Authors: ")
while(i < num_authors):
print("-",temp_author_array[i])
i = i + 1
con.commit()
con.close()
print("\nNumber of entries processed: ", count)
You may decode the data you have extracted first, by simply import html
if you are using python3.x 如果您使用的是python3.x,则只需导入html
即可解码首先提取的数据
Convert all named and numeric character references (eg >, >, &x3e;) in the string s to the corresponding unicode characters. 将字符串s中的所有命名和数字字符引用(例如>,>和&x3e;)转换为相应的unicode字符。
>>import html
>>print(html.unescape("Petr Slavík"))
Petr Slavík
Seems the html-safe character cannot be parsed and returned as Document object by minidom, you have to read the file and decode it, then send as a string to the module, as the following code. 似乎html-safe字符不能被最小限度地解析并作为Document对象返回,您必须读取文件并将其解码,然后作为字符串发送给模块,如以下代码所示。
Return a Document that represents the string. 返回表示字符串的Document。
file_text = html.unescape(open('text.xml', 'r').read())
xmldoc = minidom.parseString(file_text)
.encode('UTF-8') #Add to your code at the end of the example
UTF-8 Has the support for most of these characters following, should work, Add : UTF-8具有以下大多数这些字符的支持,应该可以工作,添加:
xmldoc = minidom.parse("test.xml")
NewXML = xmldoc.encode('utf-8', 'ignore')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.