简体   繁体   中英

In Python, how to map giant list to objects efficiently?

I'm parsing huge xml files (> 400 MB, ~7M lines) called FCD, which is an output from SUMO road traffic simulator. My goal is to get locations in time for each car.

An example FCD file looks like this:

<fcd-export>
    <timestep time="0.00">
        <vehicle id="flow_0.0" x="605.79" y="1142.59"/>
        <vehicle id="flow_1.0" x="1911.72" y="2154.71"/>
        <vehicle id="flow_3.0" x="1907.24" y="2163.97"/>
    </timestep>
    <timestep time="0.10">
        <vehicle id="flow_0.0" x="605.81" y="1142.61"/>
        <vehicle id="flow_1.0" x="1911.70" y="2154.69"/>
        <vehicle id="flow_3.0" x="1907.22" y="2163.95"/>
    </timestep>
    <timestep time="0.20">
        <vehicle id="flow_0.0" x="605.85" y="1142.64"/>
        <vehicle id="flow_1.0" x="1911.66" y="2154.66"/>
        <vehicle id="flow_3.0" x="1907.18" y="2163.92"/>
    </timestep>
</fcd-export>

I'm parsing it to list of such dicts: {car_id, time, x, y} using lxml and multiprocessing libraries which works fine and takes ~30 sec for 36000 timesteps, ~7M lines in the xml file. I attach parse_fcd() function in the bottom. The resulting list has 6.8M items.

Now I need to map those [time, car_id, x, y] items to have locations in time for each car. I created simple classes to store that data:

class CarInfo:
    car_id: str
    time_locations: List[TimeLocation]

class TimeLocation:
    time: float
    x: float
    y: float

I tried to do the mapping using the following code:

import multiprocessing as mp
from typing import List

def extract_car_infos_parallel(car_time_location_items: List[dict]) -> List[CarInfo]:
    car_ids = set(map(lambda item: item['car_id'], car_time_location_items))
    
    pool = mp.Pool(mp.cpu_count())
    car_infos = pool.starmap(extract_time_location_items_for_car, [(car_id, car_time_location_items) for car_id in car_ids])

    pool.close()
    pool.join()

    return car_infos
    
def extract_time_location_items_for_car(car_id: str, all_items: List[dict]) -> CarInfo: 
    car = CarInfo(car_id)
    items_for_car = list(filter(lambda item: item['car_id'] == car_id, all_items))
    car.time_locations = [TimeLocation(item['time'], item['x'], item['y']) for item in items_for_car]

    return car

The code runs about 15 minutes and throws BrokenPipeError. I tried changing list of dicts {car_id, time, x, y} to list of lists with that values and had the same result.

How can I fix this to get rid off BrokenPipeError and speed up?

PS: This is the code for parsing fcd xml files

from lxml import etree
from lxml.etree import XMLParser, parse
import multiprocessing as mp
from typing import List 

def parse_fcd_data_parallel(fcd_file: str) -> List[dict]:

    p = XMLParser(huge_tree=True)
    xml_data = parse(fcd_file, parser=p)
    fcd_data = xml_data.getroot()

    pool = mp.Pool(mp.cpu_count())

    results = pool.map(parse_fcd_timestep, [timestep for timestep in fcd_data])

    pool.close()
    pool.join()

    flatten_results = [item for sublist in results for item in sublist]
    return flatten_results

def parse_fcd_timestep(timestep) -> List[dict]:
    car_time_location_items: List[dict] = []

    time_stamp = timestep.get('time')

    for raw_car_info in timestep:
        car_id = raw_car_info.get('id')
        pos_x = raw_car_info.get('x')
        pos_y = raw_car_info.get('y')

        car_time_location_items.append({'car_id': car_id, 'time': time_stamp, 'x': pos_x, 'y': pos_y})

    return car_time_location_items







    

The problems comes from the inefficient algorithm and data structures. Parallelizing the operation does not help much.

On one hand, the CPython GIL prevent you to efficiently use multiple threads for this code. On another, multiprocessing will certainly not speed up the code due to the interprocess communication of a lot of CPython objects that need to be serialized (it should actually be slower and less flexible).

The inefficiency certainly comes from the line list(filter(lambda item: item['car_id'] == car_id, all_items)) which iterates over all the items for each searched car. This results in a quadratic O(mn) execution time where m is the size of the list (6 800 000 items) and n the number of cars (probably several hundreds of them). One much more efficient solution is to use a Pandas dataframe instead of a list of dict (which have a huge overhead in memory and execution time) and perform a groupby operation which runs in quasi-linear time (because it use either a sort or a hash-based method to group the cars). Pandas use Numpy internally which efficiently pack the items in memory and use native integer/floats. This should be several order of magnitude faster.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM