简体   繁体   English

列表列表:替换和添加子列表项

[英]List of lists: replacing and adding up items of sublists

I have a list of lists, let's say something like this: 我有一个列表列表,让我们说这样的话:

tripInfo_csv = [['1','2',6,2], ['a','h',4,2], ['1','4',6,1], ['1','8',18,3], ['a','8',2,1]]

Think of sublists as trips: [start point, end point, number of adults, number of children] 将子列表视为旅行:[起点,终点,成人数量,儿童数量]

My aim is to get a list where trips with coincident start and end points get their third and fourth values added up. 我的目标是获得一个列表,在该列表中,起点和终点重合的行程将其第三和第四值相加。 The start and end values should always be numbers from 1 to lets say 8. If they are letters instead, those should be replaced with the corresponding number (a=1, b=2, and so on). 起始值和结束值应始终是从1到8的数字。如果改为字母,则应将其替换为相应的数字(a = 1,b = 2,依此类推)。

This is my code. 这是我的代码。 It works but I'm sure it can be improved. 它有效,但是我敢肯定它可以改进。 The main issue for me is performance. 对我而言,主要问题是性能。 I have quite a number of lists like this with many more sublists. 我有很多这样的列表,还有更多的子列表。

dicPoints = {'a':'1','b':'2','c':'3', 'd':'4', 'e':'5', 'f':'6', 'g':'7', 'h':'8'}
def getTrips (trips):
    okTrips = []
    for trip in trips:
        if not trip[0].isdigit():
            trip[0] = dicPoints[trip[0]]
        if not trip[1].isdigit():
            trip[1] = dicPoints[trip[1]]

        if len(okTrips) == 0:
            okTrips.append(trip)
        else:
            for i, stop in enumerate(okTrips):
                if stop[0] == trip[0] and stop[1] == trip[1]:
                    stop[2] += trip[2]
                    stop[3] += trip[3]
                    break
                else:
                    if i == len(okTrips)-1:
                        okTrips.append(trip)

As eguaio mentioned the code above has a bug. eguaio所述,上面的代码有错误。 It should be like this: 应该是这样的:

def getTrips (trips):
    okTrips = []
    print datetime.datetime.now()
    for trip in trips:
        if not trip[0].isdigit():
            trip[0] = dicPoints[trip[0]]
        if not trip[1].isdigit():
            trip[1] = dicPoints[trip[1]]

        if len(okTrips) == 0:
            okTrips.append(trip)
        else:
            flag = 0
            for i, stop in enumerate(okTrips):
                if stop[0] == trip[0] and stop[1] == trip[1]:
                    stop[2] += trip[2]
                    stop[3] += trip[3]
                    flag = 1
                    break

            if flag == 0:
                okTrips.append(trip)

I got an improved version thanks to eguaio's answer that I want to share. 感谢eguaio的答案,我得到了改进的版本。 This is my script based on his answer. 这是我根据他的回答写的剧本。 My data and requirements are more complex now than what I was first told so I made a few changes. 现在,我的数据和要求比最初告诉我的要复杂得多,因此我做了一些更改。

CSV files look like this: CSV文件如下所示:

LineT;Line;Route;Day;Start_point;End_point;Adults;Children;First_visit
SM55;5055;3;Weekend;15;87;21;4;0 
SM02;5002;8;Weekend;AF3;89;5;0;1 
...

Script: 脚本:

import os, csv, psycopg2

folder = "F:/route_project/routes"

# Day type
dicDay = {'Weekday':1,'Weekend':2,'Holiday':3}

# Dictionary with the start and end points of each route
#  built from a Postgresql table (with coumns: line_route, start, end)
conn = psycopg2.connect (database="test", user="test", password="test", host="###.###.#.##")
cur = conn.cursor()
cur.execute('select id_linroute, start_p, end_p from route_ends')
recs = cur.fetchall()
dicPoints = {rec[0]: rec[1:] for rec in recs}

# When point labels are text, replace them with a number label in dicPoints
# Text is not important: they are special text labels for start and end
#  of routes (for athletes), so we replace them with labels for start or
#  the end of each route
def convert_point(line, route, point, i):
    if point.isdigit():
        return point
    else:
        return dicPoints["%s_%s" % (line,route)][i]

# Points with text labels mean athletes made the whole or part of this route,
#  we keep them as adults but also keep this number as an extra value
#  for further purposes
def num_athletes(start_p, end_p, adults):
    if not start_p.isdigit() or not end_p.isdigit():
        return adults
    else:
        return 0

# Data is taken for CSV files in subfolders
for root, dirs, files in os.walk(folder):
    for file in files:
        if file.endswith(".csv"):
            file_path = (os.path.join(root, file))
            with open(file_path, 'rb') as csvfile:
                rows = csv.reader(csvfile, delimiter=';', quotechar='"')
                # Skips the CSV header row
                rows.next()
                # linT is not used, yet it's found in every CSV file
                # There's an unused last column in every file, I take advantage out of it
                #  to store the number of athletes in the generator
                gen =((lin, route, dicDay[tday], convert_point(lin,route,s_point,0), convert_point(lin,route,e_point,1), adults, children, num_athletes(s_point,e_point,adults)) for linT, lin, route, tday, s_point, e_point, adults, children, athletes in rows)
                dicCSV = {}
                for lin, route, tday, s_point, e_point, adults, children, athletes in gen:
                    visitors = dicCSV.get(("%s_%s_%s" % (lin,route,s_point), "%s_%s_%s" % (lin,route,e_point), tday), (0, 0, 0))
                    dicCSV[("%s_%s_%s" % (lin,route,s_point), "%s_%s_%s" % (lin,route,e_point), tday)] = (visitors[0] + int(adults), visitors[1] + int(children), visitors[2] + int(athletes))

for k,v in dicCSV.iteritems():
    print k, v

To handle this more efficiently it's best to sort the input list by the start and end points, so that rows which have matching start and end points are grouped together. 为了更有效地处理此问题,最好按起点和终点对输入列表进行排序,以便将具有匹配起点和终点的行组合在一起。 Then we can easily use the groupby function to process those groups efficiently. 然后,我们可以轻松地使用groupby函数高效地处理这些组。

from operator import itemgetter
from itertools import groupby

tripInfo_csv = [
    ['1', '2', 6, 2], 
    ['a', 'h', 4, 2], 
    ['1', '4', 6, 1], 
    ['1', '8', 18, 3], 
    ['a', '8', 2, 1],
]

# Used to convert alphabetic point labels to numeric form
dicPoints = {v:str(i) for i, v in enumerate('abcdefgh', 1)}

def fix_points(seq):
    return [dicPoints.get(p, p) for p in seq]

# Ensure that all point labels are numeric
for row in tripInfo_csv:
    row[:2] = fix_points(row[:2])

# Sort on point labels
keyfunc = itemgetter(0, 1)
tripInfo_csv.sort(key=keyfunc)

# Group on point labels and sum corresponding adult & child numbers
newlist = []
for k, g in groupby(tripInfo_csv, key=keyfunc):
    g = list(g)
    row = list(k) + [sum(row[2] for row in g), sum(row[3] for row in g)]
    newlist.append(row)

# Print the condensed list
for row in newlist:
    print(row)

output 产量

['1', '2', 6, 2]
['1', '4', 6, 1]
['1', '8', 24, 6]

The following gives much better times than yours for large lists with much merging: 2 seconds vs. 1 minute for tripInfo_csv*500000 . 对于合并很多的大型列表,以下时间比您的时间好得多: tripInfo_csv*500000时间为2秒,而1分钟为2分钟。 We get the almost linear complexity using a dict to get the keys, that have constant lookup time. 我们使用字典来获取具有恒定查找时间的键,从而获得几乎线性的复杂度。 IMHO it is also more elegant. 恕我直言,它也更优雅。 Notice that tg is a generator, so no significant time or memory is used when created. 请注意, tg是生成器,因此在创建时不会占用大量时间或内存。

def newGetTrips(trips):

    def convert(l):
        return l if l.isdigit() else dicPoints[l]

    tg = ((convert(a), convert(b), c, d) for a, b, c, d in trips)
    okt = {}
    for a, b, c, d in tg:
        # a trick to get (0,0) as default if (a,b) is not a key of the dictionary yet
        t = okt.get((a,b), (0,0)) 
        okt[(a,b)] = (t[0] + c, t[1] + d)
    return [[a,b,c,d] for (a,b), (c,d) in okt.iteritems()]

Besides, as a side effect, you are altering the trips list and this function leaves it untouched. 此外,作为副作用,您正在更改行程列表,并且此功能保持不变。 Also, you have a bug. 另外,您还有一个错误。 You are summing twice the first item considered for each (start, end) pair (but not for the first case). 您正在对每个(开始,结束)对考虑的第一项加总两次(但对于第一种情况则不是)。 I could not find the reason, but when running the example, with your getTrips I get: 我找不到原因,但是运行示例时,使用您的getTrips我得到:

[['1', '2', 6, 2], ['1', '8', 28, 8], ['1', '4', 12, 2]]

and with newGetTrips I get: newGetTrips我得到:

[['1', '8', 24, 6], ['1', '2', 6, 2], ['1', '4', 6, 1]]

See if this helps 看看是否有帮助

trips = [['1','2',6,2], ['a','h',4,2], ['1','2',6,1], ['1','8',18,3], ['a','h',2,1]]

# To get the equivalent value
def x(n):
    if '1' <= n <= '8':
        return int(n)
    return ord(n) - ord('a')

# To group lists with similar start and end points
from collections import defaultdict


groups = defaultdict(list)

for trip in trips:
    # Grouping based on start and end point.
    groups[(x(trip[0]), x(trip[1]))].append(trip)

grouped_trips = groups.values()

result = []
for group in grouped_trips:
    start = group[0][0]
    end = group[0][1]
    adults = group[0][2]
    children = group[0][3]
    for trip in group[1:]:
        adults += trip [2]
        children += trip [3]
    result += [[start, end, adults, children]]

print result

Let say start and end points are between 0 and n values. 假设起点和终点在0到n之间。

Then, the result 'OkTrip' has maximum n^2 elements. 然后,结果“ OkTrip”具有最多n ^ 2个元素。 Then, your second loop in function has a complexity O(n^2). 然后,函数的第二个循环的复杂度为O(n ^ 2)。 It is possible to reduce the complexity to O(n) if you have not problem with space complexity. 如果您对空间复杂度没有问题,则可以将复杂度降低到O(n)。

Firslty, create dict which contains n lists such that k'(th) sublist contains trips starting with 'k'. 首先,创建包含n个列表的字典,以使k'(th)子列表包含以'k'开头的行程。

When you search whether there are different trips with same start and end points, you need search only corresponding sublist instead of searching all elements. 当搜索是否存在具有相同起点和终点的不同行程时,仅需要搜索对应的子列表,而不是搜索所有元素。

The idea comes from sparse matrix storage techniques. 这个想法来自稀疏矩阵存储技术。 I could not check validation of following code. 我无法检查以下代码的有效性。

The code is following, 代码如下,

dicPoints = {'a':'1','b':'2','c':'3', 'd':'4', 'e':'5', 'f':'6', 'g':'7', 'h':'8'}
Temp = {'1':[],'2':[],'3':[],'4':[],'5':[],'6':[],'7':[],'8':[]};
def getTrips (trips):
   okTrips = []
   for trip in trips:
        if not trip[0].isdigit():
            trip[0] = dicPoints[trip[0]]
        if not trip[1].isdigit():
            trip[1] = dicPoints[trip[1]]

        if len(Temp[trip[0]]) == 0:
            Temp[trip[0]].append(trip)
        else:
            for i, stop in enumerate(Temp[trip[0]]):
                if stop[1] == trip[1]:
                   stop[2] += trip[2]
                   stop[3] += trip[3]
                   break
                else:
                   if i == len(Temp[trip[0]])-1:
                       Temp[trip[0]].append(trip)
        print Temp

    for key in Temp:
        okTrips = okTrips + Temp[key];

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM