简体   繁体   中英

How can I load and merge several .txt files in a memory efficient way in python?

I am trying to read several (>1000).txt files (on average approx. 700 MB, delimited, header-less CSV, without commas or other separator) and merge them into one pandas dataframe (to next run an analysis on the entire dataset).

I am running this through SSH on a HPC server, on which I requested 50GB RAM, 1 node, 1 task-per-node (that was all just a wild guess, as I have never done this before).

So far, my idea was this:

li = []

for filename in all_files:
    df = pd.read_csv(filename, sep=None, header=0, engine='python')
    li.append(df)

df = pd.concat(li, axis=0, ignore_index=True)

but after a few hours and having loaded the approx. 360th file the process gets killed and I get the error message:

numpy.core._exceptions.MemoryError: Unable to allocate 1.11 GiB for an array with shape (10, 14921599) and data type float64

Do you have any idea how to load and merge the data more memory efficient? (I assume just requesting more RAM still does not get me through the entire set of.txt files??)

Also, I would like to save the resulting dataframe in a 'memory-efficient' way afterwards, do you know the best way/format (csv?) to do that?

Any help would be much appreciated!

as you said you have so many files and it needs so much memory so I suggest loading and saving all of the files in a single file in appending mode (append data to the previously saved data) like this

for filename in all_files:
    df = pd.read_csv(filename, sep=None, header=0, engine='python')
    df.to_csv('./data.csv', header=None, index=None, mode='a')

after saving all of the files in single file now you can read the single file as a dataframe like this:

df = pd.read_csv('./data.csv',header=None,index_col=False)

after that if you have any issues with reading this file because of memory you can use a reader like this:

chunksize = 10 ** 6
with pd.read_csv('./data.csv',header=None,index_col=False, chunksize=chunksize) as reader:
    for chunk in reader:
        # Do What you want

Q: "How can I... Any help would be much appreciated!"

A:
best follow the laws of the ECONOMY-of-COMPUTING:

Your briefly sketched problem has, out of question, immense "setup"-costs, having unspecified amount of some useful work to be computed on an unspecified HPC-ecosystem.

Even without hardware & rental details ( devil is always hidden in detail(s) & one can easily pay hilarious amounts of money for trying to make a (hiddenly) "shared"-platform deliver any improved computing performance - many startups have experienced this on voucher-sponsored promises, the more if an overall computing strategy was poorly designed )

I cannot resist not to quote the so called 1st Etore's Law of Evolution of Systems' Dynamics:

If we open a can of worms,
the only way how to put them back
is to use a bigger can

Closing our eyes not to see the accumulating inefficiencies is the worst sin of sins, as devastatingly exponential growths of all of costs, time & resources, incl. energy-consumption are common to meet on such, often many-levels stacked-inefficiencies' complex systems


ELEMENTARY RULES-of-THUMB... how much we pay in [TIME]

Sorry if these were known to you beforehand, just trying to build some common ground, as a platform to lay further argumentation rock-solid on. More details are here and this is only a needed beginning, as more problems will definitely come from any real-world O( M x * N y *... )-scaling related issues in further modelling.

                0.1 ns - CPU NOP - a DO-NOTHING instruction
                0.5 ns - CPU L1 dCACHE reference           (1st introduced in late 80-ies )
                1   ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance -- will stay, throughout any foreseeable future :o)
              3~4   ns - CPU L2  CACHE reference           (2020/Q1)
                7   ns - CPU L2  CACHE reference
               19   ns - CPU L3  CACHE reference           (2020/Q1 considered slow on 28c Skylake)
______________________on_CPU______________________________________________________________________________________
               71   ns - CPU cross-QPI/NUMA best  case on XEON E5-46*
              100   ns - own DDR MEMORY reference
              135   ns - CPU cross-QPI/NUMA best  case on XEON E7-*
              325   ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
            2,500   ns - Read  10 kB sequentially from  MEMORY------ HPC-node
           25,000   ns - Read 100 kB sequentially from  MEMORY------ HPC-node
          250,000   ns - Read   1 MB sequentially from  MEMORY------ HPC-node
        2,500,000   ns - Read  10 MB sequentially from  MEMORY------ HPC-node
       25,000,000   ns - Read 100 MB sequentially from  MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
      250,000,000   ns - Read   1 GB sequentially from  MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
    2,500,000,000   ns - Read  10 GB sequentially from  MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
   25,000,000,000   ns - Read 100 GB sequentially from  MEMORY------ HPC-node (abstracted from shared physical RAM-I/O-channels)
_____________________________________________________________________________own_CPU/DDR__________________________
 |   |   |   |   |
 |   |   |   | ns|
 |   |   | us|
 |   | ms|
 |  s|
h|
          500,000   ns - Round trip within a same DataCenter ------- HPC-node / HPC-storage latency on each access
       20,000,000   ns - Send   2 MB over 1 Gbps  NETWORK
      200,000,000   ns - Send  20 MB over 1 Gbps  NETWORK
    2,000,000,000   ns - Send 200 MB over 1 Gbps  NETWORK
   20,000,000,000   ns - Send   2 GB over 1 Gbps  NETWORK
  200,000,000,000   ns - Send  20 GB over 1 Gbps  NETWORK
2,000,000,000,000   ns - Send 200 GB over 1 Gbps  NETWORK
____________________________________________________________________________via_LAN_______________________________
      150,000,000   ns - Send a NETWORK packet CA -> Netherlands
____________________________________________________________________________via_WAN_______________________________
       10,000,000   ns - DISK seek spent to start file-I/O on spinning disks on any next piece of data seek/read
       30,000,000   ns - DISK   1 MB sequential READ from a DISK
      300,000,000   ns - DISK  10 MB sequential READ from a DISK
    3,000,000,000   ns - DISK 100 MB sequential READ from a DISK
   30,000,000,000   ns - DISK   1 GB sequential READ from a DISK
  300,000,000,000   ns - DISK  10 GB sequential READ from a DISK
3,000,000,000,000   ns - DISK 100 GB sequential READ from a DISK
______________________on_DISK_______________________________________________own_DISK______________________________
 |   |   |   |   |
 |   |   |   | ns|
 |   |   | us|
 |   | ms|
 |  s|
h|

Given these elements, the end-to-end computing strategy may and shall be improved.


AS-WAS STATE
... where the crash prevented any computing at all

A naive figure shows more than thousands words

localhost          |
:       file-I/O ~ 25+GB SLOWEST/EXPENSIVE
:         1st time 25+GB file-I/O-s
:                  |
:                  | RAM       |
:                  |           |
+------+           |IOIOIOIOIOI|
|.CSV 0|           |IOIOIOIOIOI|
|+------+          |IOIOIOIOIOI|
||.CSV 1|          |IOIOIOIOIOI|
||+------+         |IOIOIOIOIOI|-> local ssh()-encrypt+encapsulate-process
|||.CSV 2|         |IOIOIOIOIOI|               25+GB of .CSV
+||      |         |IOIOIOIOIOI|~~~~~~~|
 ||      |         |IOIOIOIOIOI|~~~~~~~|
 +|      |         |           |~~~~~~~|
  |      |         |           |~~~~~~~|
  +------+         |           |~~~~~~~|
  ...              |           |~~~~~~~|-> LAN  SLOW
   ...             |                   |   WAN  SLOWER
    ...            |                   |   transfer of 30+GB to "HPC" ( ssh()-decryption & file-I/O storage-costs omited for clarity )
    +------+       |                   |               |                   30+GB           file-I/O ~ 25+GB SLOWEST/EXPENSIVE
    |.CSV 9|       |                   |~~~~~~~~~~~~~~~|                                     2nd time 25+GB file-I/O-s
    |+------+      |                   |~~~~~~~~~~~~~~~|
    ||.CSV 9|      |                   |~~~~~~~~~~~~~~~|
    ||+------+     |                   |~~~~~~~~~~~~~~~|
    |||.CSV 9|     |                   |~~~~~~~~~~~~~~~|
    +||     9|     |                   |~~~~~~~~~~~~~~~|
     ||     9|     |                   |~~~~~~~~~~~~~~~|
     +|      |     |                   |~~~~~~~~~~~~~~~|
      |      |     |                   |~~~~~~~~~~~~~~~|-> file-I/O into python
      +------+     |                                   |   all .CSV file to RAM ~ 25+GB SLOWEST/EXPENSIVE
                   |                                   |***|                        3rd time 25+GB file-I/O-s
                   |                                   |   RAM .CSV to df CPU work
                   |                                   |***|           df to LIST new RAM-allocation + list.append( df )-costs
                   |                                   |***|                    + 25+GB
                   |                                   |***|
many hours         |                                   |***|
  [SERIAL] flow ...|                                   |***|
/\/\/\/\/\/\/\/\/\/|\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\|***|/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/
                   |                       crashed     |***|
                   |                    on about       |***|
                   |                       360-th file |***|
                   |                                   |***|->RAM ~ 50~GB with a LIST of all 25+GB dataframes held in LIST
                   |                                       |  CPU +mem-I/O costs LIST to new 25+GB dataframe RAM-allocation & DATA-processing
                   |                                       |~~~~~| mem-I/O RAM|              :: GB
                   |                                       |~~~~~| mem-I/O    |RAM flow of ~ 50+GB over only 2/3/? mem-I/O HW-channels
                   |                                       |~~~~~|                                      only if "HPC"
                   |                                       |~~~~~|                                           is *NOT* a "shared"-rental of cloud HW,
                   |                                       |~~~~~|                                                    remarketed as an "HPC"-illusion
                   |                                       |~~~~~|
                   |                                             :::::::::::::?
                   |                                             :::::::::::::?
                   |                                             :::::::::::::?
                   |                                            <...some amount of some usefull work --"HPC"-processing the ~ 25+GB dataframe...>
                   |                                            <...some amount of some usefull work                                         ...>
                   |                                            <...some amount of some usefull work         the more                        ...>
                   |                                            <...some amount of some usefull work         the better                      ...>
                   |                                            <...some amount of some usefull work             as                          ...>
                   |                                            <...some amount of some usefull work             it                          ...>
                   |                                            <...some amount of some usefull work             dissolves to AWFULLY        ...>
                   |                                            <...some amount of some usefull work                          HIGH           ...>
                   |                                            <...some amount of some usefull work                          SETUP COSTS    ...>
                   |                                            <...some amount of some usefull work                                         ...>
                   |                                            <...some amount of some usefull work --"HPC"-processing the ~ 25+GB dataframe...>
                   |                                             :::::::::::::?
                   |                                             :::::::::::::?
                   |                                             :::::::::::::?
                   |                                                          |-> file-I/O ~ 25+GB SLOWEST/EXPENSIVE
                   |                                                          |~~~~~|          4th time 25+GB file-I/O-s
                   |                                                          |~~~~~|
                   |                                                          |~~~~~|->file left on remote storage (?)
                   |                                                                |
                   |                                                               O?R
                   |                                                                |
                   |                                                                |-> file-I/O ~ 25+GB SLOWEST/EXPENSIVE
                   |                                                                |~~~~~|          5th time 25+GB file-I/O-s
                   |                                                                |~~~~~|
                   |                                                                |~~~~~|
                   |                                                                |~~~~~|
                   |                                                                |~~~~~|-> RAM / CPU ssh()-encrypt+encapsulate-process
                   |                                                                      |????????????|      25+GB of results for repatriation
                   |                                                                      |????????????|                        on localhost
                   |                                                                      |????????????|
                   |                                                                      |????????????|
                   |                                                                      |????????????|-> LAN  SLOW
                   |                                                                                   |   WAN  SLOWER
                   |                                                                                   |   transfer of 30+GB from "HPC" ( ssh()-decryption & file-I/O storage-costs omited for clarity )
                   |                                                                                   |               |                     30+GB           file-I/O ~ 25+GB SLOWEST/EXPENSIVE
                   |                                                                                   |~~~~~~~~~~~~~~~|                                       6th time 25+GB file-I/O-s
                   |                                                                                   |~~~~~~~~~~~~~~~|
                   |                                                                                   |~~~~~~~~~~~~~~~|
                   |                                                                                   |~~~~~~~~~~~~~~~|
                   |                                                                                   |~~~~~~~~~~~~~~~|
 SUCCESS ?         |                                                                                   |~~~~~~~~~~~~~~~|-> file transferred back and stored on localhost storage
    after          |
          how many |
          failed   |
          attempts |
   having          |
          how high |
          recurring|
          costs    |
          for any  |
              next |
              model|
          recompute|
          step(s)  |
                   |
                   |
All                |
that               |
( at what overall  |
    [TIME]-domain  |
   & "HPC"-rental  |
           costs ) |

Tips:

  • review and reduce, where possible, expensive data-items representation ( avoid using int64, where 8-bits are enough, packed bitmaps can help a lot )
  • precompute on localhost all items, that could be precomputed ( avoiding repetitive steps )
  • join the such "reduced" CSV-files, using a trivial O/S command, into a single input
  • compress all data before transports ( a few orders of magnitude saved )
  • prefer to code your computing using such algorithms' formulation, that can stream-process items along the data-flow, ie not waiting to load all in-RAM to next compute an average or similar trivial on-the-fly stream-computable values ( like .mean() , .sum() , .min() , .max() or even .rsi() , .EMA() , .TEMA() , .BollingerBands() ... and many more alike ) - stream-computing formulated algorithms reduce both the RAM-allocations, can be & shall be pre-computed (once) & minimise the [SERIAL]-one-after-another processing pipeline latency )
  • if indeed in a need to use pandas and fighting on overall physical RAM-ceilings, may try smart numpy -tools instead, where all the array syntax & methods remain the same, yet it can, by-design, work without moving all data at once from disk into physical RAM ( using this was my life-saving trick since ever, the more when using many-model simulations & HyperParameterSPACE optimisations on a few tens of GB data on 32-bit hardware )

For more details on going into the direction of RAM-protecting memory-mapped np.ndarray processing, with all smart numpy-vectorised and all other high performance-tuned tricks, read more details in this:

 >> print( np.memmap.__doc__ ) Create a memory-map to an array stored in a *binary* file on disk. Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory. NumPy's memmap's are array-like objects. (...)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM