I am trying to parse the DBLP data set using lxml in python. However it is giving this error:
lxml.etree.XMLSyntaxError: Entity 'uuml' not defined, line 54, column 43
DBLP does provide a DTD
file for defining entities here . How can I use that file to parse the DBLP XML document?
Here is my current code:
filename = sys.argv[1]
dtd_name = sys.argv[2]
db_name = sys.argv[3]
conn = sqlite3.connect(db_name)
dblp_record_types_for_publications = ('article', 'inproceedings', 'proceedings', 'book', 'incollection',
'phdthesis', 'masterthesis', 'www')
# read dtd
dtd = ET.DTD(dtd_name) #pylint: disable=E1101
# get an iterable
context = ET.iterparse(filename, events=('start', 'end'), load_dtd=True, #pylint: disable=E1101
resolve_entities=True)
# turn it into an iterator
context = iter(context)
# get the root element
event, root = next(context)
n_records_parsed = 0
for event, elem in context:
if event == 'end' and elem.tag in dblp_record_types_for_publications:
pub_year = None
for year in elem.findall('year'):
pub_year = year.text
if pub_year is None:
continue
pub_title = None
for title in elem.findall('title'):
pub_title = title.text
if pub_title is None:
continue
pub_authors = []
for author in elem.findall('author'):
if author.text is not None:
pub_authors.append(author.text)
# print(pub_year)
# print(pub_title)
# print(pub_authors)
# insert the publication, authors in sql tables
pub_title_sql_str = pub_title.replace("'", "''")
pub_author_sql_strs = []
for author in pub_authors:
pub_author_sql_strs.append(author.replace("'", "''"))
conn.execute("INSERT OR IGNORE INTO publications VALUES ('{title}','{year}')".format(
title=pub_title_sql_str,
year=pub_year))
for author in pub_author_sql_strs:
conn.execute("INSERT OR IGNORE INTO authors VALUES ('{name}')".format(name=author))
conn.execute("INSERT INTO authored VALUES ('{author}','{publication}')".format(author=author,
publication=pub_title_sql_str))
elem.clear()
root.clear()
n_records_parsed += 1
print("No. of records parsed: {}".format(n_records_parsed))
conn.commit()
conn.close()
You can add a custom URI Resolver https://lxml.de/resolvers.html :
class DTDResolver(etree.Resolver):
def resolve(self, system_url, public_id, context):
return self.resolve_filename(os.path.join("/path/to/dtd/file", system_url), context)
context.resolvers.add(DTDResolver())
将 DTD 文件保存在与 XML 文件相同的目录中并确保 DTD 文件名与 XML 文档的 doctype 声明 ( <!DOCTYPE dblp SYSTEM "dblp.dtd">
) 中的 DTD 文件名匹配后,如mzjn 在评论中建议,它不再给出语法错误。
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.