[英]How to get all unique combinations of values in one column that are in another column
[英]Loop through all the values (string) in one column and append the values in another column if not unique-Text processing
我想為以下問題找到解決方案:
import pandas as pd
rows = {'Id': ['xb01','nt02','tw02','dt92','tw03','we04','er04','ew06','re07','ti92'],
'DatasetName': ['first label','second label','third label','fourth label','third
label','third label','third label','fourth label','first label','last label'],
'Target': ['first label','second label','the third labels','fourth label
set','third label', 'third label','third label sets','fourth label sets','first
label','last labels']
}
df = pd.DataFrame(rows, columns = ['Id', 'DatasetName','Target'])
print (df)
dataframe 看起來像這樣:
Id DatasetName Target
xb01 first label first label
nt02 second label second label
tw02 third label the third labels
dt92 fourth label fourth label set
tw03 third label third label
we04 third label third label
er04 third label third label sets
ew06 fourth label fourth label sets
re07 first label first label
ti92 last label last labels
偽代碼:
for i in len(range(df)):
if DatasetName[i].is_unique:
if DatasetName[i]!=Target[i]:
Target[i]=DatasetName[i]+ '|'+Target[i]
else:
loop through dataframe and find all labels that belongs to the same DatasetName
and append all those Target names together. (Note: if DatasetName is not same as
Target Name(s), the Dataset name should also append to the Target)
在這里我們可以看到:
DatasetName Appeared Target
first label 2 first label
second label 1 second label
third label 4 the third labels | third label | third label sets
fourth label 2 fourth label set | fourth label sets|fourth label
last label 1 last labels | last label
期望 output:
Id DatasetName Target
xb01 first label first label
nt02 second label second label
tw02 third label the third labels|third label|third label sets
dt92 fourth label fourth label set|fourth label sets |fourth label
tw03 third label the third labels|third label|third label sets
we04 third label the third labels|third label|third label sets
er04 third label the third labels|third label|third label sets
ew06 fourth label fourth label set|fourth label sets| fourth label
re07 first label first label
ti92 last label last labels|last label
注意:真正的 dataframe 有 100,000 行。 這些字符串中可能仍然存在額外的空格(我已經實現了 dataframe-lower case(),刪除了所有額外的標記等)。 這個問題可能有一些錯誤(錯字)(我已經復制並粘貼了幾次),但希望你能明白我正在尋找的解決方案是什么。 謝謝!
import pandas as pd
rows = {'Id': ['xb01', 'nt02', 'tw02', 'dt92', 'tw03', 'we04',
'er04', 'ew06', 're07', 'ti92'],
'DatasetName': ['first label', 'second label', 'third label',
'fourth label', 'third label', 'third label',
'third label', 'fourth label',
'first label', 'last label'],
'Target': ['first label', 'second label', 'the third labels',
'fourth label set', 'third label',
'third label', 'third label sets',
'fourth label sets', 'first label', 'last labels']
}
df = pd.DataFrame(rows, columns=['Id', 'DatasetName', 'Target'])
# Fix Spacing In Columns names
df = df.replace({r'\s+': ' '}, regex=True)
# Get Unique Matches
matches = df.groupby('DatasetName') \
.apply(lambda x: x['DatasetName'].append(x['Target']).unique()) \
.agg('|'.join).rename('Target')
# Merge back to original DataFrame
merged = df.drop(columns=['Target']).merge(matches, on='DatasetName', how="left")
# For Display
print(merged.to_string())
Output:
Id DatasetName Target 0 xb01 first label first label 1 nt02 second label second label 2 tw02 third label third label|the third labels|third label sets 3 dt92 fourth label fourth label|fourth label set|fourth label sets 4 tw03 third label third label|the third labels|third label sets 5 we04 third label third label|the third labels|third label sets 6 er04 third label third label|the third labels|third label sets 7 ew06 fourth label fourth label|fourth label set|fourth label sets 8 re07 first label first label 9 ti92 last label last label|last labels
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.