简体   繁体   English

使用python或unix比较文件的两列

[英]Compare two columns of a file using python or unix

I have data in a file like csv, or a txt file with a specific seperator. 我有像csv这样的文件中的数据,或者带有特定分隔符的txt文件。 for example: 例如:

date|Symbol
2017-05-01|A
2017-05-01|B
2017-05-01|C
2017-05-01|A
2017-05-02|A
2017-05-02|B
2017-05-02|C
2017-05-03|A
2017-05-04|A
2017-05-04|B
2017-05-04|C
2017-05-05|A
2017-05-05|A
2017-05-05|B
2017-05-06|C
2017-05-06|A
2017-05-07|A
2017-05-05|B
2017-05-07|C
2017-05-08|A

Now I want to check if any symbol is getting repeated on a particular day,and if yes, then the symbol with date. 现在我想检查是否有任何符号在某一天重复,如果是,那么带有日期的符号。 Like Symbol A is getting repeat on 01-May, B is on 05-May. 就像符号A在5月1日重复,B在5月5日。

I am trying to do it by using python, that Putting all Symbols in a list, and then check it over column one if any date is getting repeated. 我试图通过使用python,将所有符号放在列表中,然后在第一列检查,如果任何日期重复。

Is there any other solutions than this. 还有其他解决方案吗?

Read line by line then split by pipe |: 逐行读取然后通过管道拆分|:

ln.split("|")[1]

This will show characters like AB ... 这将显示像AB这样的人物......

Compare this with others 与其他人比较

With python difflib https://pymotw.com/2/difflib/ 使用python difflib https://pymotw.com/2/difflib/

import difflib
from difflib_data import *

d = difflib.Differ()
diff = d.compare(text1_lines, text2_lines)
print '\n'.join(diff)

I have created a list of dictionaries and each dictionary have key as data and list of column 2 as a value. 我创建了一个字典列表,每个字典都有键作为数据,列2的列表作为值。 now i checked in every dictionary if any thing is repeating. 现在我检查了每一本字典是否有任何重复。

If any one have better solution than this, then it is most welcome. 如果任何人有比这更好的解决方案,那么最受欢迎。

Updating implementation code for above: 更新上面的实现代码:

with open(file_path,"rb") as f:
    reader = csv.reader(f,delimiter=delmtr)
    for line in reader:
        if is_header == 1:
            is_header = 0
            continue
        date_dict = {}
        inst_fl_col = inst_col - 1
        date_fl_col = date_col - 1
        if line[date_fl_col] not in date_list:
            date_list.append(line[date_fl_col])
            instrument_list = []
            instrument_list.append(line[inst_fl_col])
            date_dict[line[date_fl_col]] = instrument_list
            p_list.append(date_dict)
            csvwriter.writerow(line)
            del date_dict,instrument_list
        else:
            for dicts in p_list:
                for k,v in dicts.items():
                    if k == line[date_fl_col]:
                        if line[inst_fl_col] not in v:
                            v.append(line[inst_fl_col])
                            csvwriter.writerow(line)
                        else:
                            count += 1
nw_fl.close()
print str(count)+" rows ignored in newly created "+new_file_name+" file"
del date_list[:],is_header,csvwriter,count

I did it by using basic knowledge of python, now i'm improving this using collections module and defaultdict class. 我是通过使用python的基本知识来完成的,现在我正在使用collections模块和defaultdict类来改进它。 Please let me know if any one require the improved code. 如果有人要求改进代码,请告诉我。

Suggestion are most welcome. 建议是最受欢迎的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM