[英]Machine Learning Approach needed: Predict most likely feature value given all other features in a feature vector
[英]Reconciling Records (Date and Number Value): Given two datasets with multiple features, how to get the most likely match?
假設我有兩個數據集base
和payment
。
base
是:
[ id, timestamp, value]
payment
是:
[ payment_id, timestamp, value, gateway ]
我想調和base
和payment
。 期望的結果是:
[id, timestamp, value, payment_id, gateway, probability]
基本上它應該告訴我對於給定的基本條目最有可能的payment_id
。 匹配應考慮日期時間和值。 如果它只給出了概率最高的那個,我會感到滿意,但是我也不會打擾第二個/第三個建議。
到目前為止,我已經閱讀了一些關於模糊匹配和相似性學習,余弦simality和東西的東西,但似乎無法將這些應用於我的問題。 我想過手動做一些事情:
for each_entry in base:
value_difference = base['value'] - payment['value']
time_difference = base['timestamp'] - payment['timestamp']
if value_difference <= 0.1 and time_difference <= 0.1:
#if the difference is small, then tell me the payment_id.
問題是,這看起來像一個真正的“轉儲”方法,如果有多個payment_entry
符合條件,可能會有沖突,我將不得不手動微調參數以獲得良好的結果。
我希望找到一種更智能,更自動的方法來幫助調和這兩個數據集。
有沒有人對如何處理這個問題有任何建議?
編輯:我目前的狀態:
import pandas as pd
import time
from itertools import islice
from pandas import ExcelWriter
import datetime
from random import uniform
orders = pd.read_excel("Orders.xlsx")
pmts = pd.read_excel("Payments.xlsx")
pmts['date'] = pd.to_datetime(pmts.date)
orders['data'] = pd.to_datetime(orders.data)
payment_list = []
for index, row in pmts.iterrows():
new_entry = {}
ts = row['date']
new_entry['id'] = row['id']
new_entry['date'] = ts.to_pydatetime()
new_entry['value'] = row['value']
new_entry['types'] = row['pmt']
new_entry['results'] = []
payment_list.append(new_entry)
order_list = []
for index, row in orders.iterrows():
new_entry = {}
ts = row['data']
new_entry['id'] = row['Id1']
new_entry['date'] = ts.to_pydatetime()
new_entry['value'] = row['valor']
new_entry['types'] = row['nome']
new_entry['results'] = []
order_list.append(new_entry)
for each_entry in order_list:
for each_payment in payment_list:
delta_value = (each_entry['value'] - each_payment['value'])
try:
delta_time = abs(each_entry['date'] - each_payment['date'])
except:
TypeError
pass
results = []
delta_ref = datetime.timedelta(minutes=60)
if delta_value == 0 and delta_time < delta_ref:
result_type = each_payment['types']
result_id = each_payment['id']
results.append(result_type)
results.append(delta_time)
results.append(result_id)
each_entry['results'].append(results)
result_id = each_entry['id']
each_payment['results'].append(result_id)
orders2 = pd.DataFrame(order_list)
writer = ExcelWriter('OrdersList.xlsx')
orders2.to_excel(writer)
writer.save()
pmts2 = pd.DataFrame(payment_list)
writer = ExcelWriter('PaymentList.xlsx')
pmts2.to_excel(writer)
writer.save()
好的,現在我得到了一些東西。 它返回所有具有相同值和timedelta低於x的條目(在這種情況下為60分鍾)。 無法讓我更好地給出最可能的結果,也不能給出匹配正確的概率(相同數量,小窗口時間)。 會繼續努力。
最簡單的方法可能是選擇具有最小差異的基礎/支付對。 例如:
base_data = [...] # all base data
payment_data = [...] # all payment data
def prop_diff(a,b,props):
# this iterates through all specified properties and
# sums the result of the differences
return sum([a[prop]-b[prop] for prop in props])
def join_data(base, payment):
# you need to implement your merging strategy here
return joined_base_and_payment
results = [] # where we will store our merged results
working_payment = payment_data.copy()
for base in base_data:
# check the difference between the lists
diffs = []
for payment in working_payment:
diffs.append(prop_diff(base, payment, ['value', 'timestamp']))
# find the index of the payment with the minimum difference
min_idx = 0
for i, d in enumerate(diffs):
if d < diffs[min_idx]:
min_idx = i
# append the result of the joined lists
results.append(join_data(base, working_payment[min_idx]))
del working_payment[min_idx] # remove the selected payment
print(results)
基本思想是找到列表之間的總差異,並選擇具有最小差異的對。 在這種情況下,我復制了payment_data
因此我們不會破壞它,我們實際上刪除了付款條目,一旦我們將其與基數匹配並附加結果。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.