![](/img/trans.png)
[英]How to merge different CSV file into a new CSV with one primary key
[英]How to find a columns set for a primary key candidate in CSV file?
我有一個CSV文件(未規范化,例如,實際文件最多100列):
ID, CUST_NAME, CLIENT_NAME, PAYMENT_NUM, START_DATE, END_DATE
1, CUST1, CLIENT1, 10, 2018-04-01, 2018-04-02
2, CUST1, CLIENT1, 10, 2018-04-01, 2018-05-30
3, CUST1, CLIENT1, 101, 2018-04-02, 2018-04-03
4, CUST2, CLIENT1, 102, 2018-04-02, 2018-04-03
如何找到所有可能用作主鍵的列集。
所需的輸出:
1) ID
2) PAYMENT_NUM,START_DATE,END_DATE
3) CUST_NAME, CLIENT_NAME, PAYMENT_NUM,START_DATE,END_DATE
我可以用Java做到,但可能是Python / Pandas已經提供了快速解決方案
熊貓和itertools將為您提供所需的東西。
import pandas
from itertools import chain, combinations
def key_options(items):
return chain.from_iterable(combinations(items, r) for r in range(1, len(items)+1) )
df = pandas.read_csv('test.csv');
# iterate over all combos of headings, excluding ID for brevity
for candidate in key_options(list(df)[1:]):
deduped = df.drop_duplicates(candidate)
if len(deduped.index) == len(df.index):
print ','.join(candidate)
這將為您提供輸出:
PAYMENT_NUM, END_DATE
CUST_NAME, CLIENT_NAME, END_DATE
CUST_NAME, PAYMENT_NUM, END_DATE
CLIENT_NAME, PAYMENT_NUM, END_DATE
PAYMENT_NUM, START_DATE, END_DATE
CUST_NAME, CLIENT_NAME, PAYMENT_NUM, END_DATE
CUST_NAME, CLIENT_NAME, START_DATE, END_DATE
CUST_NAME, PAYMENT_NUM, START_DATE, END_DATE
CLIENT_NAME, PAYMENT_NUM, START_DATE, END_DATE
CUST_NAME, CLIENT_NAME, PAYMENT_NUM, START_DATE, END_DATE
這是通過itertools.combinations
一種方法。 對於每組列,它都通過刪除重復項並檢查數據框的大小是否發生變化來工作。
這將產生44種不同的列組合。
from itertools import combinations, chain
full_list = chain.from_iterable(combinations(df, i) for i in range(1, len(df.columns)+1))
n = len(df.index)
res = []
for cols in full_list:
cols = list(cols)
if len(df[cols].drop_duplicates().index) == n:
res.append(cols)
print(len(res)) # 44
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.