简体   繁体   中英

Python and Memory Consumption

I am searching for a way to be able to handle overloading the RAM and CPU using a high memory program... I would like to process a LARGE amount of data contained in files. I then read the files and process the data therein. The problem is there are many nested for loops and a root XML file is being created from all the data processed. The program easily consumes a couple gigs of RAM after half hour or so of run-time. Is there something I can do to not let RAM get so big and/or work around it.. ?

Do you really need to keep the whole data from the XML file on memory at once?

Most (all?) XML libraries out there allow you to do iterative parsing , meaning that you keep in memory just a few nodes of the XML file , not the whole file . That is unless you are making a string containing the XML file yourself without any library, but that is a bit insane. If that is the case, use a library ASAP.

The specific code samples presented here might not apply to your project, but consider a few principles—borne out by testing and the lxml documentation—when faced with XML data measured in gigabytes or more:

  • Use an iterative parsing strategy to incrementally process large documents.
  • If searching the entire document in random order is required, move to an indexed XML database.
  • Be extremely conservative in the data that you select. If you are only interested in particular nodes, use methods that select by those names. If you require predicate syntax, try one of the XPath classes and methods available.
  • Consider the task at hand and the comfort level of the developer. Object models such as lxml 's objectify or Amara might be more natural for Python developers when speed is not a consideration. cElementTree is faster when only parsing is required.
  • Take the time to do even simple benchmarking. When processing millions of records, small differences add up, and it is not always obvious which methods are the most efficient.

If you need to do complex operations on the data, why don't you just put it on a relational database and operate on the data from there? That will have better performance.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM