[英]Problems parsing XML with lxml
我一直在尝试将XML提要解析为Pandas数据帧,但无法找出我要去哪里。
import pandas as pd
import requests
import lxml.objectify
path = "http://www2.cineworld.co.uk/syndication/listings.xml"
xml = lxml.objectify.parse(path)
root = xml.getroot()
代码的下一部分是解析我想要的部分,并创建显示词典列表。
shows_list = []
for r in root.cinema:
rec = {}
rec['name'] = r.attrib['name']
rec['info'] = r.attrib["root"] + r.attrib['url']
listing = r.find("listing")
for f in listing.film:
film = rec
film['title'] = f.attrib['title']
film['rating'] = f.attrib['rating']
shows = f.find("shows")
for s in shows['show']:
show = rec
show['time'] = s.attrib['time']
show['url'] = s.attrib['url']
#print show
shows_list.append(rec)
df = pd.DataFrame(show_list)
当我运行代码时,电影和时间字段似乎在行中被多次复制。 但是,如果我在代码中添加了一条打印语句(已将其注释掉),则字典将如我所愿。
我究竟做错了什么? 请随时让我知道是否还有更Python化的解析过程。
编辑:澄清:
如果我使用print语句检查遍历时发生的情况,则这些是数据的最后五行。
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729365&seats=STANDARD', 'time': '2016-02-07T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729366&seats=STANDARD', 'time': '2016-02-08T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729367&seats=STANDARD', 'time': '2016-02-09T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729368&seats=STANDARD', 'time': '2016-02-10T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729369&seats=STANDARD', 'time': '2016-02-11T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'PG', 'name': 'Cineworld Stoke-on-Trent', 'title': 'Autism Friendly Screening - Goosebumps', 'url': '/booking?performance=4782937&seats=STANDARD', 'time': '2016-02-07T11:00:00'}
这是列表的结尾:...
{'info': 'http://cineworld.co.uk/cinemas/107/information',
'name': 'Cineworld Stoke-on-Trent',
'rating': 'PG',
'time': '2016-02-07T11:00:00',
'title': 'Autism Friendly Screening - Goosebumps',
'url': '/booking?performance=4782937&seats=STANDARD'},
{'info': 'http://cineworld.co.uk/cinemas/107/information',
'name': 'Cineworld Stoke-on-Trent',
'rating': 'PG',
'time': '2016-02-07T11:00:00',
'title': 'Autism Friendly Screening - Goosebumps',
'url': '/booking?performance=4782937&seats=STANDARD'},
{'info': 'http://cineworld.co.uk/cinemas/107/information',
'name': 'Cineworld Stoke-on-Trent',
'rating': 'PG',
'time': '2016-02-07T11:00:00',
'title': 'Autism Friendly Screening - Goosebumps',
'url': '/booking?performance=4782937&seats=STANDARD'},
{'info': 'http://cineworld.co.uk/cinemas/107/information',
'name': 'Cineworld Stoke-on-Trent',
'rating': 'PG',
'time': '2016-02-07T11:00:00',
'title': 'Autism Friendly Screening - Goosebumps',
'url': '/booking?performance=4782937&seats=STANDARD'}]
您的代码只有一个不断更新的对象: rec
。 尝试这个:
from copy import copy
shows_list = []
for r in root.cinema:
rec = {}
rec['name'] = r.attrib['name']
rec['info'] = r.attrib["root"] + r.attrib['url']
listing = r.find("listing")
for f in listing.film:
film = copy(rec) # New object
film['title'] = f.attrib['title']
film['rating'] = f.attrib['rating']
shows = f.find("shows")
for s in shows['show']:
show = copy(film) # New object, changed reference
show['time'] = s.attrib['time']
show['url'] = s.attrib['url']
#print show
shows_list.append(show) # Changed reference
df = pd.DataFrame(show_list)
利用这种结构,在数据rec
被复制到每个film
,并且在每个数据film
被复制到每个show
。 然后,最后将show
添加到shows_list
。
您可能需要阅读这篇文章,以了解更多有关line film = rec
,即您film = rec
原始字典起另一个名字,而不是创建一个新的字典。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.