简体   繁体   English

Python:基于键列组合数据

[英]Python: Combine data based on a key column

Having data containing both parent and child records inside a same text file (two headers).在同一个文本文件(两个标题)中包含包含父记录和子记录的数据。 Parent is department and childs are employees and dno is the join column. Parent 是department ,childs 是employeesdno是连接列。

dno,dname,loc
10,FIN,USA
20,HR,EUR
30,SEC,AUS
empno,ename,sal,dno
2100,AMY,1001,10
2200,JENNY,2001,10
3100,RINI,3001,20
4100,EMP4,4001,30
4200,EMP5,5001,30
4300,EMP6,6001,30

Would like to combine both data by dno and create an output like below:想通过dno组合这两个数据并创建一个 output 如下所示:

empno,ename,sal,dno,dname,loc
2100,AMY,1001,10,FIN,USA
2200,JENNY,2001,10,FIN,USA
3100,RINI,3001,20,HR,EUR
4100,EMP4,4001,30,SEC,AUS
4200,EMP5,5001,30,SEC,AUS
4300,EMP6,6001,30,SEC,AUS

Python version - 2.6

Have tried the following solution:已尝试以下解决方案:

dept_lst = []
emp_lst = []

with open(efile,'rb') as e_file:
    reader = csv.reader(e_file,delimiter=",")
    for row in reader:
        if ((row[0] != 'dno' and row[0] != 'dname' ) or 
            (row[0] != 'empno' and row[0] != 'ename')):
            if len(row) == 3:
                dept_lst.append(row)
            elif len(row) == 4:
                emp_lst.append(row)

result = [ e + [d[1],d[2]] for e in emp_lst for d in dept_lst if e[3] == d[0]]

for line in result:
    print ",".join(line)

Question: Original data is like more than 1GB and this seems to be working.问题:原始数据超过 1GB,这似乎有效。 Not sure if this is an optimal solution.不确定这是否是最佳解决方案。

Would like to know if there are any other efficient ways/alternatives of handling this scenario using Python Standard Library - 2.6 .想知道使用Python Standard Library - 2.6是否有任何其他有效的方法/替代方案来处理这种情况。

Consider reading the first part and building a dictionary of follow-ups, then switching to the second part and using the dictionary.考虑阅读第一部分并建立一个后续词典,然后切换到第二部分并使用该词典。 Also, consider using a CSV writer to write processed rows at once instead of saving them as a list.此外,考虑使用 CSV 编写器一次写入已处理的行,而不是将它们保存为列表。

dno = {}
# Why do you open the file in the binary mode?
with open("efile.csv", "r") as e_file,\
     open("ofile.csv", "w") as o_file:
    reader = csv.reader(e_file)
    next(reader) # Skip the header
    for row in reader:
        if row[0] == 'empno':
            break # The second part begins
        dno[row[0]] = row[1:]
    writer = csv.writer(o_file)
    for row in reader:
        writer.writerow(row + dno[row[3]])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM