Pyspark MapReduce - how to get number occurrences in a list of tuple

Question

I have a list like:

A 2022-08-13
B 2022-08-14
B 2022-08-13
A 2022-05-04
B 2022-05-04
C 2022-08-14
...

and I applied the following map functions to map each row with the # of occurrences:

map(lambda x: ((x.split(',')[0], x.split(',')[1]), 1))

To get this:

[
    (('A', '2022-08-13'), 1), 
    (('B', '2022-08-14'), 1), 
    (('B', '2022-08-13'), 1), 
    (('A', '2022-05-04'), 1),
    (('B', '2022-05-04'), 1),  
    (('C', '2022-08-14'), 1),
    ...
]

My end goal is to find the number of occurrences where two persons (denoted by the letter) have the same dates, to output something like this for the example above:

[
    ('A', 'B', 2),
    ('B', 'C', 1),
    ...
]

This is my code so far, but the reduceByKey is not working as expected:

shifts_mapped = worker_shifts.map(lambda x: (x.split(',')[1], 1))
shifts_mapped = worker_shifts.map(lambda x: ((x.split(',')[0], x.split(',')[1]), 1))
count = shifts_mapped.reduceByKey(lambda x, y: x[0][1] + y[0][1])

Answer 1

Create a dataframe, 'df' from the input where the first column has name ID and the second column name 'DATE'.
Then join the dataframe with itself on the 'DATE' column creating renamed columns `ID1' and 'ID2'.
Then remove extraneous rows, ie where the column 'ID1' == column 'ID2' or has a greater value.
Then create a dictionary where the keys are the unique ('ID1', 'ID2') pairs and its value is a set of all dates for that pair. A set is used so duplicate input entries are not counted.
For each key in the dictionary look at the length of its value.

import pandas as pd

lines = """A 2022-08-13
B 2022-08-14
B 2022-08-13
A 2022-05-04
B 2022-05-04
C 2022-08-14"""

lines = [line.split(' ') for line in lines.split('\n')]
df = pd.DataFrame(lines, columns=['ID', 'DATE'])

# Join dataframe with iteself:
df_join = df.join(df.set_index('DATE'), on='DATE', lsuffix='1', rsuffix='2')

# Get rid of rows that that have the same id value or the second id value is > first id value.
# In other words if we have ('A', 'A') , ('A', 'B') and ('B', 'A'),
# we keep only ('A', 'B')
df_join = df_join[df_join['ID1'] < df_join['ID2']].reset_index(drop=True)

d = {}
for idx, row in df_join.iterrows():
    key = (row['ID1'], row['ID2'])
    # Use a set in case of duplicate entries
    if key not in d:
        d[key] = set()
    d[key].add(row['DATE'])
results = [(k[0], k[1], len(v)) for k, v in d.items()]
print(results)

Prints:

[('A', 'B', 2), ('B', 'C', 1)

Answer 2

Here is another attempt using RDD and canonical APIs as requested in the edit. The logic is documented in the comments before each transformation.

# create sample dataframe
data = [["A","2022-08-13"],["E","2022-08-13"],["D","2022-08-13"],["B","2022-08-14"],["B","2022-08-13"],["D","2022-05-04"],["E","2022-05-04"],["A","2022-05-04"],["B","2022-05-04"],["C","2022-08-14"]]
rdd = spark.sparkContext.parallelize(data)

# group by ("person", "date") and count
rdd = rdd.map(lambda x: ((x[0], x[1]), 1)).groupByKey().mapValues(len).map(lambda x: (x[0][0], x[0][1], x[1]))

# group by ("date", "count") and collect "person" as list
rdd = rdd.map(lambda x: ((x[1], x[2]), x[0])).groupByKey().mapValues(list).map(lambda x: (x[0][0], x[0][1], x[1]))

# create pair combinations
import itertools
rdd = rdd.map(lambda x: ((x[0], x[1]), list(itertools.combinations(sorted(x[2]), 2)))).flatMapValues(lambda x: x)

# group by pair, count, and split pair to individual person columns
rdd = rdd.map(lambda x: ((x[1][0], x[1][1]), 1)).groupByKey().mapValues(len).map(lambda x: (x[0][0], x[0][1], x[1]))

print(rdd.collect())

Output:

[
    ('A', 'B', 2), 
    ('A', 'D', 2), 
    ('A', 'E', 2), 
    ('B', 'D', 2), 
    ('B', 'E', 2), 
    ('D', 'E', 2), 
    ('B', 'C', 1)
]

Answer 3

Group by multiple times, first by "person", "date" and then by "date", "count" and collect persons with same date and count.

Then generate pair combinations, explode, and separate pair.

I extended your sample dataset to include persons "D" & "E" same as "A" & "B" to generate more combinations.

df = spark.createDataFrame(data=[["A","2022-08-13"],["E","2022-08-13"],["D","2022-08-13"],["B","2022-08-14"],["B","2022-08-13"],["D","2022-05-04"],["E","2022-05-04"],["A","2022-05-04"],["B","2022-05-04"],["C","2022-08-14"]], schema=["person", "date"])

df = df.groupBy("person", "date").count()

df = df.groupBy("date", "count") \
       .agg(F.collect_list("person").alias("persons"))

@F.udf(returnType="array<struct<col1:string, col2:string>>")
def combinations(arr): 
  import itertools
  return list(itertools.combinations(sorted(arr), 2))

df = df.withColumn("persons", combinations("persons"))

df = df.withColumn("persons", F.explode("persons"))

df = df.withColumn("person_1", F.col("persons").getField("col1")) \
       .withColumn("person_2", F.col("persons").getField("col2"))

df = df.groupBy("person_1", "person_2").count()

Output:

+--------+--------+-----+
|person_1|person_2|count|
+--------+--------+-----+
|B       |C       |1    |
|D       |E       |2    |
|A       |E       |2    |
|A       |D       |2    |
|B       |D       |2    |
|A       |B       |2    |
|B       |E       |2    |
+--------+--------+-----+

Pyspark MapReduce - how to get number occurrences in a list of tuple

Question

3 answers

solution1
2 2022-12-05 13:23:52

solution2
1 2022-12-05 05:53:27

solution3
0 2022-12-04 14:27:03

Pyspark MapReduce - how to get number occurrences in a list of tuple

Question

3 answers

solution1 2 2022-12-05 13:23:52

solution2 1 2022-12-05 05:53:27

solution3 0 2022-12-04 14:27:03

solution1
2 2022-12-05 13:23:52

solution2
1 2022-12-05 05:53:27

solution3
0 2022-12-04 14:27:03