[英]Compare two csv files to ouput the matches Python
我有一個名為“ organs.csv”的csv文件,另一個具有大量數據的csv文件。 我正在比較它們以獲得它們之間的匹配。 后一個文件沒有任何特定格式,因此我不知道哪一列包含有關器官的數據。 我試過下面的代碼來獲取匹配項,但是有兩個問題。
我希望它執行以下操作:
碼:
import csv
filename = "file.csv"
complist, orglist = [], []
fileA = open(filename, "rb")
reader = csv.reader(fileA, delimiter=',')
for row in reader:
for row_str in row:
complist.append(row_str)
with open("organs.csv", "rb") as fileB:
reader = csv.reader(fileB, delimiter='\n')
for row in reader:
orglist += row
orglist = [x.lower() for x in orglist]
org = open ("organ_matches.txt", "wb")
org_writer = csv.writer(org)
for s in complist:
for xs in orglist:
if xs in s:
print >> org, xs
org.close()
orgfile = open ("organ_matches.txt" , "r")
organ = orgfile.read()
organ = organ.split("\n")
organ = ",".join (organ)
organ = organ.split(",")
orgfile.close()
print organ
csv1:
forearm
leg
abdomen
csv2:
h1,h2,h3,h4
data1,forearm biopsy,tissue,cell
data2,leg injury,tissue in leg,cell9
data4,data,tissue4,cell6
現在可以打印:
['forearm','leg','leg']
所需的輸出:
['forearm','leg','-']
在這里,我最終使用列表推導 *
來存儲器官名稱,接下來,我在另一個文件的第二行到最后一行循環,使用stop
輔助變量一次從兩個循環退出(這是您沒有做的事情)趕上...)。
organs = [line.strip() for line in file('uno.csv')]
matches = []
for line in [line for line in file('due.csv')][1:]:
stop = 0
matches.append('-')
for item in line.split(','):
if stop : break
for organ in organs:
if organ in item:
matches[-1] = organ
stop = 1
print matches
在這里,我刪除了不起眼的輔助變量,並使用了更棘手,更晦澀但更令人愉快的方法(對我來說...)
organs = [line.strip() for line in file('uno.csv')]
matches = []
for line in [line for line in file('due.csv')][1:]:
match = '-'
for item in line.split(','):
if match != '-' : break
for organ in organs:
if organ in item:
match = organ
matches.append(match)
print matches
['forearm', 'leg', '-']
*
編輯 organs
的順序似乎對您很重要,因此我將用於存儲器官名稱的數據結構從一組更改為一個列表。
編輯#2
從OP可以清楚地看到,對於due.csv
每一行,只需要一個匹配項即可。 我不知道(回想起來)如何只選擇一場比賽。
我認為,我們要掃描的item
S IN各line
從左至右和停止掃描,當我們找到一個匹配,到目前為止好......但如果一個item
不止一個匹配的organ
?
我當前的代碼總是在organs
上完成for
循環,因此附加的匹配項是uno.csv
定義的順序中的最后一個匹配uno.csv
...
如果所請求的匹配是第一個,則必須修改我的代碼for
在organs
上的for
循環中添加一個break
for organ in organs:
if organ in item:
match = organ
break
就是說,選擇是你的...
以下代碼通常可以正常工作,而忽略csv2的標題行:
import csv
orglist = []
organ_matches = []
# Generate list of organs
with open('organs.csv', 'rb') as f_org:
csv_f = csv.reader(f_org)
for row in csv_f:
orglist.append(row[0])
# Convert to a set
set_org = set(orglist)
# Read csv2 file
with open('file.csv', 'rb') as f_tbl:
# Open output file to write to
with open('organ_matches.txt', 'wb') as f_out:
csv_f = csv.reader(f_tbl)
csv_f.next() # Ignore header
for row in csv_f:
set_row = set(' '.join(row).split(' ')) # Combine list elements and separate words
# Find common words with organs list and select only one
if set_row.intersection(set_org):
organ_match = list(set_row.intersection(set_org))[0]
else:
organ_match = '-'
organ_matches.append(organ_match)
f_out.write(organ_match + '\n')
您只需對數據文件(complist)進行一次循環,即可刪除多余的嵌套循環。
這樣您:
for s in complist: for xs in orglist: if xs in s: print >> org, xs
變成:
for s in complist: if s in orglist: print >> org, s else: print >> org, '-'
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.