[英]Compare two dataframes with different shapes and with condition in python
我在 python 中有两个数据框
第一个数据框: tf_words :形状(1 行,2235 列) :看起来像-
0 1 2 3 4 5 6 ...... 2234
0 aa, aaa, aaaa, aaan, aaanu, aada, aadhyam,.....zindabad]
第二个数据框: tf1_bigram :形状 (4000, 34319) :包含 bigram 及其在数据集中的出现,数据框看起来像 -
(a, en) (a, ha) (a, padam) (aa, aala) (aa, accountinte) (aa,adhamanaya)...
1 0 0 1 0 0 ...
0 1 0 0 1 0 ...
0 0 1 0 0 1 ...
我必须将 tf_words 数据帧与 tf1_bigram 数据帧进行比较,比较应该如下
例如,如在 tf_words 数据帧中所见,虽然单词 'aa' 仅与列中的一个单词匹配: (aa, aala) (aa, accountinte) & (aa,adhamanaya) 在 tf1_bigram 数据报中,但那些匹配的列值将乘以0.5。
然后检查“aaa”,如果找到,则将找到的列乘以 0.5;
然后检查'aaaa',如果找到将找到的列乘以0.5;
然后对于“aaan”,如果找到,则将找到的列乘以 0.5
依此类推,直到最后一个单词'zindabad'(具有第 2234 号库)
因此输出 tf1_bigram 将如下所示:
(a, en) (a, ha) (a, padam) (aa, aala) (aa, accountinte) (aa,adhamanaya)...
1 0 0 0.5 0 0 ...
0 1 0 0 0.5 0 ...
0 0 1 0 0 0.5 ...
我试过: tf1_bigram.apply(lambda x: np.multiply(x * 0.5) if x.name in tf_words else x) 但输出输出不是我所期望的。
请帮忙...!!!!!!!!!
尝试这个
import pandas as pd
table = {
'a, en':[1,0,0],
'a, ha':[0,1,0],
'a, padam':[0,0,1],
'aa, aala' :[1,0,0],
'aaa, accountinte':[0,1,0],
'aaaa,adhamanaya':[0,0,1],
'aaab,adhamanaya':[0,0,1]
}
tf1_bigram = pd.DataFrame(table)
table = {0:['aa'], 1:['aaa'], 2:['aaaa'], 3:['aaan'], 4:['aaanu'], 5:['aada'], 6:['aadhyam']}
tf_words = pd.DataFrame(table)
list_tf_words = tf_words.values.tolist()
print(tf1_bigram)
print(f'\n\n-------------BREAK-----------\n\n')
def func(x):
for y in list_tf_words[0]:
if x.name.find(y) != -1:
return x*0.5
else:
pass
return x
tf1_bigram = tf1_bigram.apply(func, axis = 0)
print(tf1_bigram)
输出
a, en a, ha a, padam ... aaa, accountinte aaaa,adhamanaya aaab,adhamanaya
0 1 0 0 ... 0 0 0
1 0 1 0 ... 1 0 0
2 0 0 1 ... 0 1 1
[3 rows x 7 columns]
-------------BREAK-----------
a, en a, ha a, padam ... aaa, accountinte aaaa,adhamanaya aaab,adhamanaya
0 1 0 0 ... 0.0 0.0 0.0
1 0 1 0 ... 0.5 0.0 0.0
2 0 0 1 ... 0.0 0.5 0.5
[3 rows x 7 columns]
如果您想多次乘以 0.5,请使用下面的代码
import pandas as pd
table = {
'a, en':[1,0,0],
'a, ha':[0,1,0],
'a, padam':[0,0,1],
'aa, aala' :[1,0,0],
'aaa, aaanu, accountinte':[0,1,0],
'aaaa,adhamanaya':[0,0,1]
}
tf1_bigram = pd.DataFrame(table)
table = {0:['aa'], 1:['aaa'], 2:['aaaa'], 3:['aaan'], 4:['aaanu'], 5:['aada'], 6:['aadhyam']}
tf_words = pd.DataFrame(table)
list_tf_words = tf_words.values.tolist()
print(tf1_bigram)
print(f'\n\n-------------BREAK-----------\n\n')
def func(x):
for y in list_tf_words[0]:
if x.name.find(y) != -1:
x = x*0.5
else:
pass
return x
tf1_bigram = tf1_bigram.apply(func, axis = 0)
print(tf1_bigram)
输出
a, en a, ha a, padam aa, aala aaa, aaanu, accountinte aaaa,adhamanaya
0 1 0 0 1 0 0
1 0 1 0 0 1 0
2 0 0 1 0 0 1
-------------BREAK-----------
a, en a, ha a, padam aa, aala aaa, aaanu, accountinte aaaa,adhamanaya
0 1 0 0 0.5 0.0000 0.000
1 0 1 0 0.0 0.0625 0.000
2 0 0 1 0.0 0.0000 0.125
试试这个,如果你需要用 tf_words 比较该列的内容
import pandas as pd
table = {
'a, en':[1,0,0],
'a, ha':[0,1,0],
'a, padam':[0,0,1],
'aa, aala' :[1,0,0],
'aaa, accountinte':[0,1,0],
'aaaa,adhamanaya':[0,0,1],
'aaab,adhamanaya':[0,0,1]
}
tf1_bigram = pd.DataFrame(table)
table = {0:['a'], 1:['en'], 2:['aaaa'], 3:['aaan'], 4:['aaanu'], 5:['aada'], 6:['aadhyam']}
tf_words = pd.DataFrame(table)
list_tf_words = tf_words.values.tolist()
print(tf1_bigram)
print(f'\n\n-------------BREAK-----------\n\n')
def func(x):
temp = x.name.split(',')
for y in list_tf_words[0]:
if (temp[0].strip()) in list_tf_words[0] and (temp[1].strip()) in list_tf_words[0]: # change "and" condition case only one value need match with the list of tf_words
return x*0.5
else:
return x
tf1_bigram = tf1_bigram.apply(func, axis = 0)
print(tf1_bigram)
输出
a, en a, ha a, padam ... aaa, accountinte aaaa,adhamanaya aaab,adhamanaya
0 1 0 0 ... 0 0 0
1 0 1 0 ... 1 0 0
2 0 0 1 ... 0 1 1
[3 rows x 7 columns]
-------------BREAK-----------
a, en a, ha a, padam ... aaa, accountinte aaaa,adhamanaya aaab,adhamanaya
0 0.5 0 0 ... 0 0 0
1 0.0 1 0 ... 1 0 0
2 0.0 0 1 ... 0 1 1
[3 rows x 7 columns]
元组的解决方案:
import pandas as pd
table = {
('a', 'en'):(1,0,0),
('a', 'ha'):[0,1,0],
('a', 'padam'):[0,0,1],
('aa', 'aala') :[1,0,0],
('aaa', 'accountinte'):[0,1,0],
('aaaa','adhamanaya'):[0,0,1],
('aaab','adhamanaya'):[0,0,1]
}
tf1_bigram = pd.DataFrame(table)
table = {0:['a'], 1:['en'], 2:['aaaa'], 3:['aaan'], 4:['aaanu'], 5:['aada'], 6:['aadhyam']}
tf_words = pd.DataFrame(table)
list_tf_words = tf_words.values.tolist()
print(tf1_bigram)
print(f'\n\n-------------BREAK-----------\n\n')
def func(x):
temp = x.name
if (temp[0].strip()) in list_tf_words[0] and (temp[1].strip()) in list_tf_words[0]: # change "and" condition case only one value need match with the list of tf_words
return x*0.5
else:
return x
tf1_bigram = tf1_bigram.apply(func, axis = 0)
print(tf1_bigram)
输出
a aa aaa aaaa aaab
en ha padam aala accountinte adhamanaya adhamanaya
0 1 0 0 1 0 0 0
1 0 1 0 0 1 0 0
2 0 0 1 0 0 1 1
-------------BREAK-----------
a aa aaa aaaa aaab
en ha padam aala accountinte adhamanaya adhamanaya
0 0.5 0 0 1 0 0 0
1 0.0 1 0 0 1 0 0
2 0.0 0 1 0 0 1 1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.