繁体   English   中英

在python中比较两个具有不同形状和条件的数据框

[英]Compare two dataframes with different shapes and with condition in python

我在 python 中有两个数据框

第一个数据框: tf_words形状(1 行,2235 列) :看起来像-

     0   1    2     3      4     5      6    ......  2234
0   aa, aaa, aaaa, aaan, aaanu, aada, aadhyam,.....zindabad]

第二个数据框: tf1_bigram形状 (4000, 34319) :包含 bigram 及其在数据集中的出现,数据框看起来像 -

(a, en) (a, ha) (a, padam) (aa, aala) (aa, accountinte) (aa,adhamanaya)...
  1        0         0         1            0                 0        ...
  0        1         0         0            1                 0        ...
  0        0         1         0            0                 1        ...

我必须将 tf_words 数据帧与 tf1_bigram 数据帧进行比较,比较应该如下

例如,如在 tf_words 数据帧中所见,虽然单词 'aa' 仅与列中的一个单词匹配: (aa, aala) (aa, accountinte) & (aa,adhamanaya) 在 tf1_bigram 数据报中,但那些匹配的列值将乘以0.5。

然后检查“aaa”,如果找到,则将找到的列乘以 0.5;

然后检查'aaaa',如果找到将找到的列乘以0.5;

然后对于“aaan”,如果找到,则将找到的列乘以 0.5

依此类推,直到最后一个单词'zindabad'(具有第 2234 号库)

因此输出 tf1_bigram 将如下所示:

(a, en) (a, ha) (a, padam) (aa, aala) (aa, accountinte) (aa,adhamanaya)...
  1        0         0         0.5          0                 0        ...
  0        1         0         0            0.5               0        ...
  0        0         1         0            0                 0.5      ...

我试过: tf1_bigram.apply(lambda x: np.multiply(x * 0.5) if x.name in tf_words else x) 但输出输出不是我所期望的。

请帮忙...!!!!!!!!!

尝试这个

import pandas as pd
table = {
    'a, en':[1,0,0],
    'a, ha':[0,1,0],
    'a, padam':[0,0,1],
    'aa, aala' :[1,0,0],
    'aaa, accountinte':[0,1,0],
    'aaaa,adhamanaya':[0,0,1],
    'aaab,adhamanaya':[0,0,1]
           }
tf1_bigram = pd.DataFrame(table)

table = {0:['aa'], 1:['aaa'], 2:['aaaa'], 3:['aaan'], 4:['aaanu'], 5:['aada'], 6:['aadhyam']}
tf_words  = pd.DataFrame(table)

list_tf_words = tf_words.values.tolist()

print(tf1_bigram)

print(f'\n\n-------------BREAK-----------\n\n')


def func(x):
    for y in list_tf_words[0]:
        if x.name.find(y) != -1:
            return x*0.5
        else:
            pass
    return x

tf1_bigram = tf1_bigram.apply(func, axis = 0) 

print(tf1_bigram)

输出

   a, en  a, ha  a, padam  ...  aaa, accountinte  aaaa,adhamanaya  aaab,adhamanaya
0      1      0         0  ...                 0                0                0
1      0      1         0  ...                 1                0                0
2      0      0         1  ...                 0                1                1

[3 rows x 7 columns]


-------------BREAK-----------


   a, en  a, ha  a, padam  ...  aaa, accountinte  aaaa,adhamanaya  aaab,adhamanaya
0      1      0         0  ...               0.0              0.0              0.0
1      0      1         0  ...               0.5              0.0              0.0
2      0      0         1  ...               0.0              0.5              0.5

[3 rows x 7 columns]

如果您想多次乘以 0.5,请使用下面的代码

import pandas as pd
table = {
    'a, en':[1,0,0],
    'a, ha':[0,1,0],
    'a, padam':[0,0,1],
    'aa, aala' :[1,0,0],
    'aaa, aaanu, accountinte':[0,1,0],
    'aaaa,adhamanaya':[0,0,1]
              }
tf1_bigram = pd.DataFrame(table)

table = {0:['aa'], 1:['aaa'], 2:['aaaa'], 3:['aaan'], 4:['aaanu'], 5:['aada'], 6:['aadhyam']}
tf_words  = pd.DataFrame(table)

list_tf_words = tf_words.values.tolist()

print(tf1_bigram)

print(f'\n\n-------------BREAK-----------\n\n')


def func(x):
    for y in list_tf_words[0]:
        if x.name.find(y) != -1:
            x = x*0.5
        else:
            pass
    return x

tf1_bigram = tf1_bigram.apply(func, axis = 0) 

print(tf1_bigram)

输出

   a, en  a, ha  a, padam  aa, aala  aaa, aaanu, accountinte  aaaa,adhamanaya
0      1      0         0         1                        0                0
1      0      1         0         0                        1                0
2      0      0         1         0                        0                1


-------------BREAK-----------


   a, en  a, ha  a, padam  aa, aala  aaa, aaanu, accountinte  aaaa,adhamanaya
0      1      0         0       0.5                   0.0000            0.000
1      0      1         0       0.0                   0.0625            0.000
2      0      0         1       0.0                   0.0000            0.125

试试这个,如果你需要用 tf_words 比较该列的内容

import pandas as pd
table = {
    'a, en':[1,0,0],
    'a, ha':[0,1,0],
    'a, padam':[0,0,1],
    'aa, aala' :[1,0,0],
    'aaa, accountinte':[0,1,0],
    'aaaa,adhamanaya':[0,0,1],
    'aaab,adhamanaya':[0,0,1]
           }
tf1_bigram = pd.DataFrame(table)

table = {0:['a'], 1:['en'], 2:['aaaa'], 3:['aaan'], 4:['aaanu'], 5:['aada'], 6:['aadhyam']}
tf_words  = pd.DataFrame(table)

list_tf_words = tf_words.values.tolist()

print(tf1_bigram)

print(f'\n\n-------------BREAK-----------\n\n')


def func(x):
    temp = x.name.split(',')
    for y in list_tf_words[0]: 
        if (temp[0].strip()) in list_tf_words[0] and (temp[1].strip()) in list_tf_words[0]: # change "and" condition case only one value need match with the list of tf_words 
            return x*0.5
        else:
            return x

tf1_bigram = tf1_bigram.apply(func, axis = 0) 

print(tf1_bigram)

输出

   a, en  a, ha  a, padam  ...  aaa, accountinte  aaaa,adhamanaya  aaab,adhamanaya
0      1      0         0  ...                 0                0                0
1      0      1         0  ...                 1                0                0
2      0      0         1  ...                 0                1                1

[3 rows x 7 columns]


-------------BREAK-----------


   a, en  a, ha  a, padam  ...  aaa, accountinte  aaaa,adhamanaya  aaab,adhamanaya
0    0.5      0         0  ...                 0                0                0
1    0.0      1         0  ...                 1                0                0
2    0.0      0         1  ...                 0                1                1

[3 rows x 7 columns]

元组的解决方案:

import pandas as pd
table = {
    ('a', 'en'):(1,0,0),
    ('a', 'ha'):[0,1,0],
    ('a', 'padam'):[0,0,1],
    ('aa', 'aala') :[1,0,0],
    ('aaa', 'accountinte'):[0,1,0],
    ('aaaa','adhamanaya'):[0,0,1],
    ('aaab','adhamanaya'):[0,0,1]
           }
tf1_bigram = pd.DataFrame(table)

table = {0:['a'], 1:['en'], 2:['aaaa'], 3:['aaan'], 4:['aaanu'], 5:['aada'], 6:['aadhyam']}
tf_words  = pd.DataFrame(table)

list_tf_words = tf_words.values.tolist()

print(tf1_bigram)

print(f'\n\n-------------BREAK-----------\n\n')


def func(x):
    temp = x.name
    if (temp[0].strip()) in list_tf_words[0] and (temp[1].strip()) in list_tf_words[0]: # change "and" condition case only one value need match with the list of tf_words 
        return x*0.5
    else:
        return x
tf1_bigram = tf1_bigram.apply(func, axis = 0) 

print(tf1_bigram)

输出

   a            aa         aaa       aaaa       aaab
  en ha padam aala accountinte adhamanaya adhamanaya
0  1  0     0    1           0          0          0
1  0  1     0    0           1          0          0
2  0  0     1    0           0          1          1


-------------BREAK-----------


     a            aa         aaa       aaaa       aaab
    en ha padam aala accountinte adhamanaya adhamanaya
0  0.5  0     0    1           0          0          0
1  0.0  1     0    0           1          0          0
2  0.0  0     1    0           0          1          1

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM