简体   繁体   中英

How to randomize two String collections in the same fashion in distributed system?

I have several collections of strings (~100k elements). They are initialized from the same network source:

List<String> source = Arrays.asList("cat Tom", "dog Woof", "bird Carry", "fish Silent", "mice Dr Black");
List<String> list1 = new ArrayList<>(source);
List<String> list2 = new ArrayList<>(source); 
List<String> list3 = new ArrayList<>(source);

During their life they could be modified independently:

list1.add("raccoon George");
list2.set(2, "bird Frank"); // Different bird
list3.remove(2); // No birds!

These collections live in different JVMs, but have the same (shared) seed:

String seed = UUID.randomUUID().toString();

I need some sort/shuffle method (or Comparator ) to randomize these collections:

magicSort(list1, seed); // JVM #1
magicSort(list2, seed); // JVM #2
magicSort(list3, seed); // JVM #3

To receive consistent/repeatable output like this:

[mice Dr Black, bird Carry, raccoon George, cat Tom, dog Woof, fish Silent]
[mice Dr Black, bird Frank, cat Tom, dog Woof, fish Silent]
[mice Dr Black, cat Tom, dog Woof, fish Silent]

This doesn't work:

  • Collections.shuffle() from this answer - because it does not provide affinity for items with the same prefix;
  • keep randomized order in Map<String, UUID> - because of (a) collection's dynamic nature and (b) distributed setup.

Any help will be appreciated

Edit #1

This doesn't work:

  • sha1 , md5 and similar hash algorithms - because they do not provide affinity for items with the same prefix - bird Carry and bird Frank could be placed far from each other.

Edit #2

Additional condition: items with the same prefix should be grouped together (affinity), eg all birds are placed between mices and raccoons.

Many thanks for hashing hint in comments!

I've ended up with re-mapping characters order through pseudo-random hashing (similar to Fisher–Yates shuffle). The rest code was copy-pasted from String.compareTo :

private void magicSort(List<String> list, String seed) {
    list.sort(new ShuffleComparator(seed));
}

public class ShuffleComparator implements Comparator<String> {
    private static final long MAGIC = 0x5DEECE66DL;
    private final String seed;

    public ShuffleComparator(String seed) {
        this.seed = seed;
    }

    @Override
    public int compare(String s1, String s2) {
        int len1 = s1.length();
        int len2 = s2.length();
        int lim = Math.min(len1, len2);

        int seedLen = seed.length();
        long random = seed.hashCode();

        for (int k = 0; k < lim; k++) {
            random = random * MAGIC + seed.charAt(k % seedLen);
            char c1 = s1.charAt(k);
            char c2 = s2.charAt(k);
            if (c1 != c2) {
                return ((random % (c1 + 0xFFFF)) & 0xFFFF) -
                    ((random % (c2 + 0xFFFF)) & 0xFFFF) > 0 ? 1 : -1;
            }
        }
        return (random > 0 ? 1 : -1) * (len1 - len2);
    }
}

Result:

  • You can keep collections shuffle-sorted at all times
  • Items with identical prefix are grouped together (affinity)
  • The only resource (string seed) is shared between JVMs
  • Negligible impact on performance
  • Zero GC

Examples:

seed: b85a9119-3a98-4491-b503-fc2cfdc344f3
[raccoon George, bird Carry, mice Dr Black, fish Silent, cat Tom, dog Woof]
[bird Frank, mice Dr Black, fish Silent, cat Tom, dog Woof]
[mice Dr Black, fish Silent, cat Tom, dog Woof]

seed: d5378421-d0ea-4b17-80e9-1de6a163bf38
[bird Carry, dog Woof, fish Silent, raccoon George, mice Dr Black, cat Tom]
[bird Frank, dog Woof, fish Silent, mice Dr Black, cat Tom]
[dog Woof, fish Silent, mice Dr Black, cat Tom]

seed: a5f94f60-1207-4f83-9723-bbab5e3b8075
[cat Tom, raccoon George, mice Dr Black, bird Carry, dog Woof, fish Silent]
[cat Tom, mice Dr Black, bird Frank, dog Woof, fish Silent]
[cat Tom, mice Dr Black, dog Woof, fish Silent]

seed: 4b77a61f-c639-42fc-96b2-e1add9f359e9
[bird Carry, raccoon George, cat Tom, mice Dr Black, dog Woof, fish Silent]
[bird Frank, cat Tom, mice Dr Black, dog Woof, fish Silent]
[cat Tom, mice Dr Black, dog Woof, fish Silent]

seed: 2dd78e21-d4ad-4e8a-b139-5d35fe2b0481
[dog Woof, fish Silent, raccoon George, mice Dr Black, bird Carry, cat Tom]
[dog Woof, fish Silent, mice Dr Black, bird Frank, cat Tom]
[dog Woof, fish Silent, mice Dr Black, cat Tom]

Edit #1

For tests I've also implemented hash-based comparator, so would publish it here. Just in case if you don't need affinity (like in my case), but need pure shuffle:

private void magicSort(List<String> list, String seed) {
    list.sort(new HashComparator(seed, "SHA-1"));
}

public class HashComparator implements Comparator<String> {
    private static final long MAGIC = 0x5DEECE66DL;
    private final ThreadLocal<MessageDigest> messageDigest;
    private final byte[] seed;
    private final int seedHash;

    public HashComparator(String seed, String algorithm) {
        this.seed = seed.getBytes(ISO_8859_1);
        this.seedHash = seed.hashCode();
        this.messageDigest = ThreadLocal.withInitial(() -> {
            try {
                return MessageDigest.getInstance(algorithm);
            } catch (NoSuchAlgorithmException e) {
                throw new RuntimeException(e.getMessage(), e);
            }
        });
    }

    @Override
    public int compare(String s1, String s2) {
        if (s1.equals(s2)) {
            return 0;
        }
        int diff = getSignature(s1).compareTo(getSignature(s2));
        if (diff != 0) {
            return diff;
        }
        long random = seedHash;
        random = random * MAGIC + s1.hashCode();
        random = random * MAGIC + s2.hashCode();
        return ((random & 1) == 0 ? 1 : -1) * s1.compareTo(s2);

    }

    private ByteBuffer getSignature(String s) {
        MessageDigest digest = messageDigest.get();
        digest.reset();
        digest.update(seed);
        for (int i = 0, size = s.length(); i < size; i++) {
            char c = s.charAt(i);
            digest.update((byte) (c >> 8));
            digest.update((byte) c);
        }
        return ByteBuffer.wrap(digest.digest());
    }
}

Result:

  • You can keep collections shuffle-sorted at all times
  • Pure shuffle (no affinity for items with identical prefix)
  • The only resource (string seed) is shared between JVMs

Examples:

8d465fde-9310-4b6c-9709-5cc395deb45e
[raccoon George, mice Dr Black, dog Woof, fish Silent, cat Tom, bird Carry]
[bird Frank, mice Dr Black, dog Woof, fish Silent, cat Tom]
[mice Dr Black, dog Woof, fish Silent, cat Tom]

6daa90dd-f7e7-4615-a45c-fefb0de10c20
[bird Carry, mice Dr Black, fish Silent, cat Tom, dog Woof, raccoon George]
[mice Dr Black, bird Frank, fish Silent, cat Tom, dog Woof]
[mice Dr Black, fish Silent, cat Tom, dog Woof]

9f30b259-c8d1-41f5-8221-40b17cb04260
[cat Tom, mice Dr Black, raccoon George, dog Woof, fish Silent, bird Carry]
[bird Frank, cat Tom, mice Dr Black, dog Woof, fish Silent]
[cat Tom, mice Dr Black, dog Woof, fish Silent]

94b6710f-3232-453d-8009-6d81658b2cca
[dog Woof, fish Silent, bird Carry, cat Tom, mice Dr Black, raccoon George]
[dog Woof, fish Silent, cat Tom, mice Dr Black, bird Frank]
[dog Woof, fish Silent, cat Tom, mice Dr Black]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM