如何洗牌二维二进制矩阵，保留边际分布

Question

假设我有一个 (n*m) 二进制矩阵df ，类似于以下内容：

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.binomial(1, .3, size=(6,8)))

    0   1   2   3   4   5   6   7
   ------------------------------
0 | 0   0   0   0   0   1   1   0
1 | 0   1   0   0   0   0   0   0
2 | 0   0   0   0   1   0   0   0
3 | 0   0   0   0   0   1   0   1
4 | 0   1   1   0   1   0   0   0
5 | 1   0   1   1   1   0   0   1

我想打乱矩阵中的值以创建相同形状的new_df ，这样两个边际分布都是相同的，如下所示：

    0   1   2   3   4   5   6   7
   ------------------------------
0 | 0   0   0   0   1   0   0   1
1 | 0   0   0   0   1   0   0   0
2 | 0   0   0   0   0   0   0   1
3 | 0   1   1   0   0   0   0   0
4 | 1   0   0   0   1   1   0   0
5 | 0   1   1   1   0   1   1   0

在新矩阵中，每一行的和等于原矩阵中对应行的和，同样，新矩阵中的列与原矩阵中对应列的和相同。

解决方案很容易检查：

# rows have the same marginal distribution
assert(all(df.sum(axis=1) == new_df.sum(axis=1)))  

# columns have the same marginal distribution
assert(all(df.sum(axis=0) == new_df.sum(axis=0)))

如果 n*m 很小，我可以对洗牌使用蛮力方法：

def shuffle_2d(df):
    """Shuffles a multidimensional binary array, preserving marginal distributions"""
    # get a list of indices where the df is 1
    rowlist = []
    collist = []
    for i_row, row in df.iterrows():
        for i_col, val in row.iteritems():
            if df.loc[i_row, i_col] == 1:
                rowlist.append(i_row)
                collist.append(i_col)

    # create an empty df of the same shape
    new_df = pd.DataFrame(index=df.index, columns=df.columns, data=0)

    # shuffle until you get no repeat coordinates 
    # this is so you don't increment the same cell in the matrix twice
    repeats = 999
    while repeats > 1:
        pairs = list(zip(np.random.permutation(rowlist), np.random.permutation(collist)))
        repeats = pd.value_counts(pairs).max()

    # populate new data frame at indicated points
    for i_row, i_col in pairs:
        new_df.at[i_row, i_col] += 1

    return new_df

问题是蛮力方法的扩展性很差。 （正如印第安纳琼斯和最后的十字军东征中的那句话：https://youtu.be/Ubw5N8iVDHI?t=3 ）

作为一个快速演示，对于 n*n 矩阵，获得可接受的随机播放所需的尝试次数如下所示：（一次运行）

n   attempts
2   1
3   2
4   4
5   1
6   1
7   11
8   9
9   22
10  4416
11  800
12  66
13  234
14  5329
15  26501
16  27555
17  5932
18  668902
...

是否有一个简单的解决方案可以保留确切的边际分布（或告诉您在哪里没有其他模式可以保留该分布）？

作为后备方案，我还可以使用一种近似算法，该算法可以最小化每行的平方误差之和。

谢谢！ =)

编辑：出于某种原因，在我写这个问题之前我没有找到现有的答案，但是在发布之后它们都显示在侧边栏中：

是否可以在保留行和列频率的同时对 2D 矩阵进行洗牌？

在 perl 中随机化矩阵，保持行和列总计相同

有时你需要做的就是问...

Answer 1

主要感谢https://stackoverflow.com/a/2137012/6361632的启发，这是一个似乎可行的解决方案：


def flip1(m):
    """
    Chooses a single (i0, j0) location in the matrix to 'flip'
    Then randomly selects a different (i, j) location that creates
    a quad [(i0, j0), (i0, j), (i, j0), (i, j) in which flipping every
    element leaves the marginal distributions unaltered.  
    Changes those elements, and returns 1.

    If such a quad cannot be completed from the original position, 
    does nothing and returns 0.
    """
    i0 = np.random.randint(m.shape[0])
    j0 = np.random.randint(m.shape[1])

    level = m[i0, j0]
    flip = 0 if level == 1 else 1  # the opposite value

    for i in np.random.permutation(range(m.shape[0])):  # try in random order
        if (i != i0 and  # don't swap with self
            m[i, j0] != level):  # maybe swap with a cell that holds opposite value
            for j in np.random.permutation(range(m.shape[1])):
                if (j != j0 and  # don't swap with self
                    m[i, j] == level and  # check that other swaps work
                    m[i0, j] != level):
                    # make the swaps
                    m[i0, j0] = flip
                    m[i0, j] = level
                    m[i, j0] = level
                    m[i, j] = flip
                    return 1

    return 0

def shuffle(m1, n=100):
    m2 = m1.copy()
    f_success = np.mean([flip1(m2) for _ in range(n)])

    # f_success is the fraction of flip attempts that succeed, for diagnostics
    #print(f_success)

    # check the answer
    assert(all(m1.sum(axis=1) == m2.sum(axis=1)))
    assert(all(m1.sum(axis=0) == m2.sum(axis=0)))

    return m2

我们可以称之为：

m1 = np.random.binomial(1, .3, size=(6,8))

array([[0, 0, 0, 1, 1, 0, 0, 1],
       [1, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 1, 0, 1, 0, 1],
       [1, 1, 0, 0, 0, 1, 0, 1],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [1, 0, 1, 0, 1, 0, 0, 0]])

m2 = shuffle(m1)

array([[0, 0, 0, 0, 1, 1, 0, 1],
       [1, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 1, 1],
       [1, 1, 1, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 1, 0, 0],
       [1, 0, 0, 1, 0, 0, 0, 1]])

我们需要多少次迭代才能达到稳态分布？ 我这里设置了默认值 100，对于这些小矩阵来说已经足够了。

下面我 plot 原始矩阵和洗牌矩阵（500次）之间的相关性对于不同的迭代次数。

for _ in range(500):
    m1 = np.random.binomial(1, .3, size=(9,9)) # create starting df
    m2 = shuffle(m1, n_iters)
    corrs.append(np.corrcoef(m1.flatten(), m2.flatten())[1,0])

plt.hist(corrs, bins=40, alpha=.4, label=n_iters)

对于 9x9 矩阵，我们看到了大约 25 次迭代的改进，超过了我们就处于稳定的 state 中。

对于 18x18 矩阵，我们看到从 100 次迭代到 250 次迭代的小幅增益，但不会超出太多。

请注意，对于较大的矩阵，开始分布和结束分布之间的相关性较低，但我们需要更长的时间才能到达那里。

Answer 2

您必须寻找两行两列，其切点给出一个矩阵，顶部为 1 0，底部为 0 1（或相反）。 这些值可以切换（到 01 和 10）。

甚至还有一种算法，可以从 Verhelst（2008，链接到文章页面）开发的具有相同边际（在 R 包 RaschSampler 中实现）的所有可能矩阵中进行采样。

Wang (2020, link ) 的更新算法在某些情况下更有效，也可用。

如何洗牌二维二进制矩阵，保留边际分布

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-06-04 20:05:13

解决方案2
0 2021-11-13 02:16:52

如何洗牌二维二进制矩阵，保留边际分布

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-06-04 20:05:13

解决方案2 0 2021-11-13 02:16:52

解决方案1
1 已采纳 2020-06-04 20:05:13

解决方案2
0 2021-11-13 02:16:52