[英]pandas: populate df column with values matching index and column in another df
[英]Regroup column values in a pandas df
我有一個script
,根據pandas
df
兩columns
分配值。 下面的代碼能夠實現第一步,但我正在努力實現第二步。
所以腳本最初應該:
1)為[Area]
每個string
分配一個Person
,並在[Place]
分配前3 unique values
2)看重新分配People
具有小於3 unique values
為例。 下面的df
在[Area]
和[Place]
中有6 unique values
。 但是分配了3 People
。 理想情況下, 2
人將分別獲得2 unique values
d = ({
'Time' : ['8:03:00','8:17:00','8:20:00','10:15:00','10:15:00','11:48:00','12:00:00','12:10:00'],
'Place' : ['House 1','House 2','House 1','House 3','House 4','House 5','House 1','House 1'],
'Area' : ['X','X','Y','X','X','X','X','X'],
})
df = pd.DataFrame(data=d)
def g(gps):
s = gps['Place'].unique()
d = dict(zip(s, np.arange(len(s)) // 3 + 1))
gps['Person'] = gps['Place'].map(d)
return gps
df = df.groupby('Area', sort=False).apply(g)
s = df['Person'].astype(str) + df['Area']
df['Person'] = pd.Series(pd.factorize(s)[0] + 1).map(str).radd('Person ')
輸出:
Time Place Area Person
0 8:03:00 House 1 X Person 1
1 8:17:00 House 2 X Person 1
2 8:20:00 House 1 Y Person 2
3 10:15:00 House 3 X Person 1
4 10:15:00 House 4 X Person 3
5 11:48:00 House 5 X Person 3
6 12:00:00 House 1 X Person 1
7 12:10:00 House 1 X Person 1
如您所見,第一步工作正常。 或[Area]
每個單獨的string
, [Area]
[Place]
中的前3 unique values
分配給Person
。 這使得Person 1
具有3 values
, Person 2
具有1 value
, Person 3
具有2 values
。
第二步是我在努力的地方。
如果Person
分配的3 unique values
少於3 unique values
,請更改此3 unique values
,以便每個Person
最多具有3 unique values
預期產出:
Time Place Area Person
0 8:03:00 House 1 X Person 1
1 8:17:00 House 2 X Person 1
2 8:20:00 House 1 Y Person 2
3 10:15:00 House 3 X Person 1
4 10:15:00 House 4 X Person 2
5 11:48:00 House 5 X Person 2
6 12:00:00 House 1 X Person 1
7 12:10:00 House 1 X Person 1
描述:
Person 1
已經為所有商品分配了3 unique values
。 Person 2
和第3
人的情況較少,所以我們應該將這些結合起來。 所有重復值應保持不變。
在下面我在代碼的最后一行之前添加了幾行:
d = ({'Time': ['8:03:00', '8:17:00', '8:20:00', '10:15:00', '10:15:00', '11:48:00', '12:00:00', '12:10:00'],
'Place': ['House 1', 'House 2', 'House 1', 'House 3', 'House 4', 'House 5', 'House 1', 'House 1'],
'Area': ['X', 'X', 'Y', 'X', 'X', 'X', 'X', 'X']})
df = pd.DataFrame(data=d)
def g(gps):
s = gps['Place'].unique()
d = dict(zip(s, np.arange(len(s)) // 3 + 1))
gps['Person'] = gps['Place'].map(d)
return gps
df = df.groupby('Area', sort=False).apply(g)
s = df['Person'].astype(str) + df['Area']
# added lines
t = s.value_counts()
df_sub = df.loc[s[s.isin(t[t < 3].index)].index].copy()
df_sub["tag"] = df_sub["Place"] + df_sub["Area"]
tags = list(df_sub.tag.unique())
f = lambda x: f'R{int(tags.index(x) / 3) + 1}'
df_sub['reassign'] = df_sub.tag.apply(f)
s[s.isin(t[t < 3].index)] = df_sub['reassign']
df['Person'] = pd.Series(pd.factorize(s)[0] + 1).map(str).radd('Person ')
說實話,我不確定它在所有情況下都能正常工作,但它會在測試用例中給出你想要的輸出。
讓我們看看我是否能夠幫助你對你想要做的事情有所了解。
你有順序數據(我稱之為事件),你想為每個事件分配一個“人”標識符。 您將在每個后續事件中分配的標識符取決於先前的分配,在我看來,它需要遵循以下規則來順序應用:
我知道你 :我可以重復使用以前的標識符:對於給定的標識符,已經出現了“Place”和“Area”的相同值( 有時間與它有關嗎? )。
我不認識你 :我會在以下情況下創建一個新的標識符:出現一個新的Area值( 所以Place和Area扮演不同的角色? )。
我認識你嗎? : 如果出現以下情況,我可能會重復使用先前使用過的標識符:至少有三個事件沒有分配標識符( 如果多個標識符發生這種情況會怎么樣?我會假設我使用最舊的...)。
不,我不這樣做 :如果上述規則都不適用,我將創建一個新的標識符。
假設以上內容是解決方案的實現:
# dict of list of past events assigned to each person. key is person identifier
people = dict()
# new column for df (as list) it will be appended at the end to dataframe
persons = list()
# first we define the rules
def i_know_you(people, now):
def conditions(now, past):
return [e for e in past if (now.Place == e.Place) and (now.Area == e.Area)]
i_do = [person for person, past in people.items() if conditions(now, past)]
if i_do:
return i_do[0]
return False
def i_do_not_know_you(people, now):
conditions = not bool([e for past in people.values() for e in past if e.Area == now.Area])
if conditions:
return f'Person {len(people) + 1}'
return False
def do_i_know_you(people, now):
i_do = [person for person, past in people.items() if len(past) < 3]
if i_do:
return i_do[0]
return False
# then we process the sequential data
for event in df.itertuples():
print('event:', event)
for rule in [i_know_you, i_do_not_know_you, do_i_know_you]:
person = rule(people, event)
print('\t', rule.__name__, person)
if person:
break
if not person:
person = f'Person {len(people) + 1}'
print('\t', "nah, I don't", person)
if person in people:
people[person].append(event)
else:
people[person] = [event]
persons.append(person)
df['Person'] = persons
輸出:
event: Pandas(Index=0, Time='8:00:00', Place='House 1', Area='X', Person='Person 1')
i_know_you False
i_do_not_know_you Person 1
event: Pandas(Index=1, Time='8:30:00', Place='House 2', Area='X', Person='Person 1')
i_know_you False
i_do_not_know_you False
do_i_know_you Person 1
event: Pandas(Index=2, Time='9:00:00', Place='House 1', Area='Y', Person='Person 2')
i_know_you False
i_do_not_know_you Person 2
event: Pandas(Index=3, Time='9:30:00', Place='House 3', Area='X', Person='Person 1')
i_know_you False
i_do_not_know_you False
do_i_know_you Person 1
event: Pandas(Index=4, Time='10:00:00', Place='House 4', Area='X', Person='Person 2')
i_know_you False
i_do_not_know_you False
do_i_know_you Person 2
event: Pandas(Index=5, Time='10:30:00', Place='House 5', Area='X', Person='Person 2')
i_know_you False
i_do_not_know_you False
do_i_know_you Person 2
event: Pandas(Index=6, Time='11:00:00', Place='House 1', Area='X', Person='Person 1')
i_know_you Person 1
event: Pandas(Index=7, Time='11:30:00', Place='House 6', Area='X', Person='Person 3')
i_know_you False
i_do_not_know_you False
do_i_know_you False
nah, I don't Person 3
event: Pandas(Index=8, Time='12:00:00', Place='House 7', Area='X', Person='Person 3')
i_know_you False
i_do_not_know_you False
do_i_know_you Person 3
event: Pandas(Index=9, Time='12:30:00', Place='House 8', Area='X', Person='Person 3')
i_know_you False
i_do_not_know_you False
do_i_know_you Person 3
最終的數據幀是,你想要的:
Time Place Area Person
0 8:00:00 House 1 X Person 1
1 8:30:00 House 2 X Person 1
2 9:00:00 House 1 Y Person 2
3 9:30:00 House 3 X Person 1
4 10:00:00 House 4 X Person 2
5 10:30:00 House 5 X Person 2
6 11:00:00 House 1 X Person 1
7 11:30:00 House 6 X Person 3
8 12:00:00 House 7 X Person 3
9 12:30:00 House 8 X Person 3
備注 :請注意,我有意避免按順序使用按操作和處理數據分組。 我認為這種復雜性( 而不是真正理解你想做什么...... )需要這種方法。 此外,您可以使用上述相同的結構使規則更加復雜( 是時候真的扮演一個角色? )
看看新數據很明顯我不明白你想要做什么(特別是,這種分配似乎並不遵循順序規則 )。 我會有一個適用於您的第二個數據集的解決方案,但它會為第一個數據集提供不同的結果。
解決方案要簡單得多,並且會添加一個列(如果需要,可以稍后刪除):
df["tag"] = df["Place"] + df["Area"]
tags = list(df.tag.unique())
f = lambda x: f'Person {int(tags.index(x) / 3) + 1}'
df['Person'] = df.tag.apply(f)
在第二個數據集上,它將給出:
Time Place Area tag Person
0 8:00:00 House 1 X House 1X Person 1
1 8:30:00 House 2 X House 2X Person 1
2 9:00:00 House 3 X House 3X Person 1
3 9:30:00 House 1 Y House 1Y Person 2
4 10:00:00 House 1 Z House 1Z Person 2
5 10:30:00 House 1 V House 1V Person 2
在它給出的第一個數據集上:
Time Place Area tag Person
0 8:00:00 House 1 X House 1X Person 1
1 8:30:00 House 2 X House 2X Person 1
2 9:00:00 House 1 Y House 1Y Person 1
3 9:30:00 House 3 X House 3X Person 2
4 10:00:00 House 4 X House 4X Person 2
5 10:30:00 House 5 X House 5X Person 2
6 11:00:00 House 1 X House 1X Person 1
7 11:30:00 House 6 X House 6X Person 3
8 12:00:00 House 7 X House 7X Person 3
9 12:30:00 House 8 X House 8X Person 3
這與索引2和3上的預期輸出不同。此輸出是否符合您的要求? 為什么不?
據我所知,你對Person分配之前的所有內容感到滿意。 所以這里是一個即插即用的解決方案,用於“合並”具有少於3個唯一值的人,因此每個人最終得到3個唯一值,除了最后一個顯然(基於你發布的第二個到最后一個df(“輸出:”)沒有觸摸已經有3個唯一值的那些並且只合並其他值。
編輯:非常簡化的代碼。 再次,將您的df作為輸入:
n = 3
df['complete'] = df.Person.apply(lambda x: 1 if df.Person.tolist().count(x) == n else 0)
df['num'] = df.Person.str.replace('Person ','')
df.sort_values(by=['num','complete'],ascending=True,inplace=True) #get all persons that are complete to the top
c = 0
person_numbers = []
for x in range(0,999): #Create the numbering [1,1,1,2,2,2,3,3,3,...] with n defining how often a person is 'repeated'
if x % n == 0:
c += 1
person_numbers.append(c)
df['Person_new'] = person_numbers[0:len(df)] #Add the numbering to the df
df.Person = 'Person ' + df.Person_new.astype(str) #Fill the person column with the new numbering
df.drop(['complete','Person_new','num'],axis=1,inplace=True)
首先,這個答案不符合你的要求,只能重新分配剩菜(所以我不指望你接受它)。 也就是說,無論如何我都會發布它,因為你的時間窗口約束在熊貓世界中很難解決。 也許我的解決方案現在對你沒用,但也許以后;)至少對我來說這是一次學習經歷 - 所以也許其他人可以從中獲益。
import pandas as pd
from datetime import datetime, time, timedelta
import random
# --- helper functions for demo
random.seed( 0 )
def makeRandomTimes( nHours = None, mMinutes = None ):
nHours = 10 if nHours is None else nHours
mMinutes = 3 if mMinutes is None else mMinutes
times = []
for _ in range(nHours):
hour = random.randint(8,18)
for _ in range(mMinutes):
minute = random.randint(0,59)
times.append( datetime.combine( datetime.today(), time( hour, minute ) ) )
return times
def makeDf():
times = makeRandomTimes()
houses = [ str(random.randint(1,10)) for _ in range(30) ]
areas = [ ['X','Y'][random.randint(0,1)] for _ in range(30) ]
df = pd.DataFrame( {'Time' : times, 'House' : houses, 'Area' : areas } )
return df.set_index( 'Time' ).sort_index()
# --- real code begins
def evaluateLookback( df, idx, dfg ):
mask = df.index >= dfg.Lookback.iat[-1]
personTotals = df[ mask ].set_index('Loc')['Person'].value_counts()
currentPeople = set(df.Person[ df.Person > -1 ])
noAllocations = currentPeople - set(personTotals.index)
available = personTotals < 3
if noAllocations or available.sum():
# allocate to first available person
person = min( noAllocations.union(personTotals[ available ].index) )
else:
# allocate new person
person = len( currentPeople )
df.Person.at[ idx ] = person
# debug
df.Verbose.at[ idx ] = ( noAllocations, available.sum() )
def lambdaProxy( df, colName ):
[ dff[1][colName].apply( lambda f: f(df,*dff) ) for dff in df.groupby(df.index) ]
lookback = timedelta( minutes = 120 )
df1 = makeDf()
df1[ 'Loc' ] = df1[ 'House' ] + df1[ 'Area' ]
df1[ 'Person' ] = None
df1[ 'Lambda' ] = evaluateLookback
df1[ 'Lookback' ] = df1.index - lookback
df1[ 'Verbose' ] = None
lambdaProxy( df1, 'Lambda' )
print( df1[ [ col for col in df1.columns if col != 'Lambda' ] ] )
我機器上的示例輸出如下所示:
House Area Loc Person Lookback Verbose
Time
2018-09-30 08:16:00 6 Y 6Y 0 2018-09-30 06:16:00 ({}, 0)
2018-09-30 08:31:00 4 Y 4Y 0 2018-09-30 06:31:00 ({}, 1)
2018-09-30 08:32:00 10 X 10X 0 2018-09-30 06:32:00 ({}, 1)
2018-09-30 09:04:00 4 X 4X 1 2018-09-30 07:04:00 ({}, 0)
2018-09-30 09:46:00 10 X 10X 1 2018-09-30 07:46:00 ({}, 1)
2018-09-30 09:57:00 4 X 4X 1 2018-09-30 07:57:00 ({}, 1)
2018-09-30 10:06:00 1 Y 1Y 2 2018-09-30 08:06:00 ({}, 0)
2018-09-30 10:39:00 10 X 10X 0 2018-09-30 08:39:00 ({0}, 1)
2018-09-30 10:48:00 7 X 7X 0 2018-09-30 08:48:00 ({}, 2)
2018-09-30 11:08:00 1 Y 1Y 0 2018-09-30 09:08:00 ({}, 3)
2018-09-30 11:18:00 2 Y 2Y 1 2018-09-30 09:18:00 ({}, 2)
2018-09-30 11:32:00 9 X 9X 2 2018-09-30 09:32:00 ({}, 1)
2018-09-30 12:22:00 5 Y 5Y 1 2018-09-30 10:22:00 ({}, 2)
2018-09-30 12:30:00 9 X 9X 1 2018-09-30 10:30:00 ({}, 2)
2018-09-30 12:34:00 6 X 6X 2 2018-09-30 10:34:00 ({}, 1)
2018-09-30 12:37:00 1 Y 1Y 2 2018-09-30 10:37:00 ({}, 1)
2018-09-30 12:45:00 4 X 4X 0 2018-09-30 10:45:00 ({}, 1)
2018-09-30 12:58:00 8 X 8X 0 2018-09-30 10:58:00 ({}, 1)
2018-09-30 14:26:00 7 Y 7Y 0 2018-09-30 12:26:00 ({}, 3)
2018-09-30 14:48:00 2 X 2X 0 2018-09-30 12:48:00 ({1, 2}, 1)
2018-09-30 14:50:00 8 X 8X 1 2018-09-30 12:50:00 ({1, 2}, 0)
2018-09-30 14:53:00 8 Y 8Y 1 2018-09-30 12:53:00 ({2}, 1)
2018-09-30 14:56:00 6 X 6X 1 2018-09-30 12:56:00 ({2}, 1)
2018-09-30 14:58:00 9 Y 9Y 2 2018-09-30 12:58:00 ({2}, 0)
2018-09-30 17:09:00 2 Y 2Y 0 2018-09-30 15:09:00 ({0, 1, 2}, 0)
2018-09-30 17:19:00 4 X 4X 0 2018-09-30 15:19:00 ({1, 2}, 1)
2018-09-30 17:57:00 6 Y 6Y 0 2018-09-30 15:57:00 ({1, 2}, 1)
2018-09-30 18:21:00 3 X 3X 1 2018-09-30 16:21:00 ({1, 2}, 0)
2018-09-30 18:30:00 9 X 9X 1 2018-09-30 16:30:00 ({2}, 1)
2018-09-30 18:35:00 8 Y 8Y 1 2018-09-30 16:35:00 ({2}, 1)
>>>
筆記:
lookback
變量控制向后看的時間長度,以考慮分配給一個人的位置 Lookback
列顯示截止時間 evaluateLookback
表中的每一行重復調用evaluateLookback
,其中df
是整個DataFrame, idx
是當前索引/標簽, dfg
是當前行。 lambdaProxy
控制的通話evaluateLookback
。 3
但可以根據需要進行調整 lambdaProxy
,然后在evaluateLookback
存儲和使用該結果 這里也有一些有趣的邊緣情況在演示輸出: 10:39:00
, 14:48:00
, 17:09:00
旁白:在熊貓中看到“功能列”會很有趣,也許會有類似記憶的能力? 理想情況下,'Person'列應該根據請求獲取一個函數和calc,或者使用自己的行或者使用一些變量窗口視圖。 有人見過這樣的東西嗎?
對於第2步,這個怎么樣:
def reduce_df(df):
values = df['Area'] + df['Place']
df1 = df.loc[~values.duplicated(),:] # ignore duplicate values for this part..
person_count = df1.groupby('Person')['Person'].agg('count')
leftover_count = person_count[person_count < 3] # the 'leftovers'
# try merging pairs together
nleft = leftover_count.shape[0]
to_try = np.arange(nleft - 1)
to_merge = (leftover_count.values[to_try] +
leftover_count.values[to_try + 1]) <= 3
to_merge[1:] = to_merge[1:] & ~to_merge[:-1]
to_merge = to_try[to_merge]
merge_dict = dict(zip(leftover_count.index.values[to_merge+1],
leftover_count.index.values[to_merge]))
def change_person(p):
if p in merge_dict.keys():
return merge_dict[p]
return p
reduced_df = df.copy()
# update df with the merges you found
reduced_df['Person'] = reduced_df['Person'].apply(change_person)
return reduced_df
print(
reduce_df(reduce_df(df)) # call twice in case 1,1,1 -> 2,1 -> 3
)
輸出:
Area Place Time Person
0 X House 1 8:03:00 Person 1
1 X House 2 8:17:00 Person 1
2 Y House 1 8:20:00 Person 2
3 X House 3 10:15:00 Person 1
4 X House 4 10:15:00 Person 2
5 X House 5 11:48:00 Person 2
6 X House 1 12:00:00 Person 1
7 X House 1 12:10:00 Person 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.