[英]Python: Combine data based on a key column
Having data containing both parent and child records inside a same text file (two headers).在同一个文本文件(两个标题)中包含包含父记录和子记录的数据。 Parent is
department
and childs are employees
and dno
is the join column. Parent 是
department
,childs 是employees
, dno
是连接列。
dno,dname,loc
10,FIN,USA
20,HR,EUR
30,SEC,AUS
empno,ename,sal,dno
2100,AMY,1001,10
2200,JENNY,2001,10
3100,RINI,3001,20
4100,EMP4,4001,30
4200,EMP5,5001,30
4300,EMP6,6001,30
Would like to combine both data by dno
and create an output like below:想通过
dno
组合这两个数据并创建一个 output 如下所示:
empno,ename,sal,dno,dname,loc
2100,AMY,1001,10,FIN,USA
2200,JENNY,2001,10,FIN,USA
3100,RINI,3001,20,HR,EUR
4100,EMP4,4001,30,SEC,AUS
4200,EMP5,5001,30,SEC,AUS
4300,EMP6,6001,30,SEC,AUS
Python version - 2.6
Have tried the following solution:已尝试以下解决方案:
dept_lst = []
emp_lst = []
with open(efile,'rb') as e_file:
reader = csv.reader(e_file,delimiter=",")
for row in reader:
if ((row[0] != 'dno' and row[0] != 'dname' ) or
(row[0] != 'empno' and row[0] != 'ename')):
if len(row) == 3:
dept_lst.append(row)
elif len(row) == 4:
emp_lst.append(row)
result = [ e + [d[1],d[2]] for e in emp_lst for d in dept_lst if e[3] == d[0]]
for line in result:
print ",".join(line)
Question: Original data is like more than 1GB and this seems to be working.问题:原始数据超过 1GB,这似乎有效。 Not sure if this is an optimal solution.
不确定这是否是最佳解决方案。
Would like to know if there are any other efficient ways/alternatives of handling this scenario using Python Standard Library - 2.6
.想知道使用
Python Standard Library - 2.6
是否有任何其他有效的方法/替代方案来处理这种情况。
Consider reading the first part and building a dictionary of follow-ups, then switching to the second part and using the dictionary.考虑阅读第一部分并建立一个后续词典,然后切换到第二部分并使用该词典。 Also, consider using a CSV writer to write processed rows at once instead of saving them as a list.
此外,考虑使用 CSV 编写器一次写入已处理的行,而不是将它们保存为列表。
dno = {}
# Why do you open the file in the binary mode?
with open("efile.csv", "r") as e_file,\
open("ofile.csv", "w") as o_file:
reader = csv.reader(e_file)
next(reader) # Skip the header
for row in reader:
if row[0] == 'empno':
break # The second part begins
dno[row[0]] = row[1:]
writer = csv.writer(o_file)
for row in reader:
writer.writerow(row + dno[row[3]])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.