简体   繁体   中英

Python 3 writing large (300+ mb) XML with lxml

I've been googling for past few days, but I simply couldn't find any remotely simillar problem :(

My script in Python 3 has simple objective:

  1. connect to MySQL database and fetch data
  2. create XML with lxml
  3. save that XML to file

Usually I had no problems with XML files containing 5000+ elements, but in this case I'm hitting max memory usage for my VPS (Amazon EC2 micro). My code (core part):

engine = create_engine(config('DB_URI'))
Session = sessionmaker(bind=engine)
session = Session()

query = session.query(Trips.Country,
                      Trips.Region,
                      Trips.Name,
                      Trips.Rebate,
                      Trips.Stars,
                      Trips.PromotionName,
                      Trips.ProductURL,
                      Trips.SubProductURL,
                      Trips.Date,
                      Trips.City,
                      Trips.Type,
                      Trips.Price,
                      TripsImages.ImageURL) \
    .join(TripsImages) \
    .all()

# define namespace xmlns:g
XMLNS = "{http://base.google.com/ns/1.0}"
NSMAP = {"g": "http://base.google.com/ns/1.0"}

# create root rss and channel
rss = etree.Element("rss", nsmap=NSMAP, attrib={"version": "2.0"})
channel = etree.SubElement(rss, "channel", attrib={"generated": str(datetime.now())})

# add <channel> title and description
channel_title = etree.SubElement(channel, "title")
channel_link = etree.SubElement(channel, "link")
channel_description = etree.SubElement(channel, "description")

channel_title.text = "Trips"
channel_link.text = "https://example.com"
channel_description.text = "Description"

# generate xml elements
for count, elem in enumerate(query):
    item = etree.SubElement(channel, "item")

    url = "/".join(["https://example.com",
                    elem.ProductURL,
                    elem.SubProductURL,
                    datetime.strftime(elem.Date, '%Y%m%d')
                    ])
    price_discounted = round(elem.Price - elem.Price * (elem.Rebate / 100))

    etree.SubElement(item, XMLNS + "id").text = str(count)
    etree.SubElement(item, XMLNS + "title").text = elem.Country
    etree.SubElement(item, XMLNS + "description").text = elem.Product
    etree.SubElement(item, XMLNS + "link").text = url
    etree.SubElement(item, XMLNS + "image_link").text = elem.ImageURL
    etree.SubElement(item, XMLNS + "condition").text = "new"
    etree.SubElement(item, XMLNS + "availability").text = "in stock"
    etree.SubElement(item, XMLNS + "price").text = str(elem.Price)
    etree.SubElement(item, XMLNS + "sale_price").text = str(price_discounted)
    etree.SubElement(item, XMLNS + "brand").text = "Brand"
    etree.SubElement(item, XMLNS + "additional_image_link").text = elem.ImageURL
    etree.SubElement(item, XMLNS + "custom_label_0").text = elem.Date.strftime("%Y-%m-%d")
    etree.SubElement(item, XMLNS + "custom_label_1").text = elem.Type
    etree.SubElement(item, XMLNS + "custom_label_2").text = str(elem.Stars / 10)
    etree.SubElement(item, XMLNS + "custom_label_3").text = elem.City
    etree.SubElement(item, XMLNS + "custom_label_4").text = elem.Country
    etree.SubElement(item, XMLNS + "custom_label_5").text = elem.PromotionName


# finally, serialize XML and save as file
with open(target_xml, "wb") as file:
    file.write(etree.tostring(rss, encoding="utf-8", pretty_print=True))

I'm using SQLAlchemy for querying database and LXML for generating XML file. When fetching data from DB, it already creates list with 228890 elements, which uses a lot of memory. Then creating XML also creates object in-memory, resulting in total use of approximately 1,5 GB RAM.

This code works fine on my laptop with 8 GB ram, but when executing on Amazon EC2 with 1 gb ram and 1 gb swap I hit write() operation and get 'Killed' response from Linux.

There's a lot covered on StackOverflow when it comes to parsing large XML files, but I couldn't find any regarding writing large files in Python except for avoiding multiple I/O operations :(

I think what you need is yield_per() function, so you do not have to process all the results at one time, but separate them in chunks. In this way you can save more memory. You can read more about this function at this link .

However, take note that yield_per() may ignore some of your query rows, and the answer in this question provides a detailed explanation. If you think you do not want to use yield_per() after reading, you may refer to all the answers posted on this stackoverflow question as well.

Another tips for you in processing large size of list is to use yield , so you do not have to load all the entries in memory at once, but process them one by one. Hope it helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM