使用lxml解析XML时出现问题

Question

I've been trying to parse an XML feed into a Pandas dataframe and can't work out where I'm going wrong. 我一直在尝试将XML提要解析为Pandas数据帧，但无法找出我要去哪里。

import pandas as pd
import requests
import lxml.objectify
path = "http://www2.cineworld.co.uk/syndication/listings.xml"

xml = lxml.objectify.parse(path)
root = xml.getroot()

The next bit of code is to parse through the bits I want and create a list of show dictionaries. 代码的下一部分是解析我想要的部分，并创建显示词典列表。

shows_list = []
for r in root.cinema:
    rec = {}
    rec['name'] = r.attrib['name']
    rec['info'] = r.attrib["root"] + r.attrib['url']
    listing = r.find("listing")
    for f in listing.film:
        film = rec
        film['title'] = f.attrib['title']
        film['rating'] = f.attrib['rating']
        shows = f.find("shows")
        for s in shows['show']:
            show = rec
            show['time'] = s.attrib['time']
            show['url'] = s.attrib['url']
            #print show
            shows_list.append(rec)

df = pd.DataFrame(show_list)

When I run the code, the film and time field seems to be replicated multiple times within rows. 当我运行代码时，电影和时间字段似乎在行中被多次复制。 However, if I put a print statement into the code (it's commented out), the dictionaries appear to as I would expect. 但是，如果我在代码中添加了一条打印语句（已将其注释掉），则字典将如我所愿。

What am I doing wrong? 我究竟做错了什么？ Please feel free to let me know if there's a more pythonic way of doing the parsing process. 请随时让我知道是否还有更Python化的解析过程。

EDIT: To clarify: 编辑：澄清：

These are the last five rows of the data if I use a print statement to check what's happening as I loop through. 如果我使用print语句检查遍历时发生的情况，则这些是数据的最后五行。

{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729365&seats=STANDARD', 'time': '2016-02-07T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729366&seats=STANDARD', 'time': '2016-02-08T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729367&seats=STANDARD', 'time': '2016-02-09T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729368&seats=STANDARD', 'time': '2016-02-10T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'TBC', 'name': 'Cineworld Stoke-on-Trent', 'title': "Dad's Army", 'url': '/booking?performance=4729369&seats=STANDARD', 'time': '2016-02-11T20:45:00'}
{'info': 'http://cineworld.co.uk/cinemas/107/information', 'rating': 'PG', 'name': 'Cineworld Stoke-on-Trent', 'title': 'Autism Friendly Screening - Goosebumps', 'url': '/booking?performance=4782937&seats=STANDARD', 'time': '2016-02-07T11:00:00'}

This is the end of the list: ... 这是列表的结尾：...

{'info': 'http://cineworld.co.uk/cinemas/107/information',
  'name': 'Cineworld Stoke-on-Trent',
  'rating': 'PG',
  'time': '2016-02-07T11:00:00',
  'title': 'Autism Friendly Screening - Goosebumps',
  'url': '/booking?performance=4782937&seats=STANDARD'},
 {'info': 'http://cineworld.co.uk/cinemas/107/information',
  'name': 'Cineworld Stoke-on-Trent',
  'rating': 'PG',
  'time': '2016-02-07T11:00:00',
  'title': 'Autism Friendly Screening - Goosebumps',
  'url': '/booking?performance=4782937&seats=STANDARD'},
 {'info': 'http://cineworld.co.uk/cinemas/107/information',
  'name': 'Cineworld Stoke-on-Trent',
  'rating': 'PG',
  'time': '2016-02-07T11:00:00',
  'title': 'Autism Friendly Screening - Goosebumps',
  'url': '/booking?performance=4782937&seats=STANDARD'},
 {'info': 'http://cineworld.co.uk/cinemas/107/information',
  'name': 'Cineworld Stoke-on-Trent',
  'rating': 'PG',
  'time': '2016-02-07T11:00:00',
  'title': 'Autism Friendly Screening - Goosebumps',
  'url': '/booking?performance=4782937&seats=STANDARD'}]

Answer 1

Your code only has one object that keeps getting updated: rec . 您的代码只有一个不断更新的对象： rec 。 Try this: 尝试这个：

from copy import copy
shows_list = []
for r in root.cinema:
    rec = {}
    rec['name'] = r.attrib['name']
    rec['info'] = r.attrib["root"] + r.attrib['url']
    listing = r.find("listing")
    for f in listing.film:
        film = copy(rec) # New object
        film['title'] = f.attrib['title']
        film['rating'] = f.attrib['rating']
        shows = f.find("shows")
        for s in shows['show']:
            show = copy(film) # New object, changed reference
            show['time'] = s.attrib['time']
            show['url'] = s.attrib['url']
            #print show
            shows_list.append(show) # Changed reference

df = pd.DataFrame(show_list)

With this structure, the data in rec is copied into each film , and the data in each film is copied into each show . 利用这种结构，在数据rec被复制到每个film ，并且在每个数据film被复制到每个show 。 Then, at the end, show is added to the shows_list . 然后，最后将show添加到shows_list 。

You might want to read this article to learn more about what's happening in your line film = rec , ie you are giving another name to the original dictionary rather than creating a new dictionary. 您可能需要阅读这篇文章，以了解更多有关line film = rec ，即您film = rec原始字典起另一个名字，而不是创建一个新的字典。

使用lxml解析XML时出现问题

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-01-19 16:07:12

使用lxml解析XML时出现问题

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-01-19 16:07:12

解决方案1
1 已采纳 2016-01-19 16:07:12