简体   繁体   中英

Find count of distinct values between two same values in a csv file using pyspark

Im working on pyspark to deal with big CSV files more than 50gb. Now I need to find the number of distinct values between two references to the same value. for example,

input dataframe:
+----+
|col1|
+----+
|   a|
|   b|
|   c|
|   c| 
|   a|   
|   b|
|   a|     
+----+


output dataframe:
+----+-----+
|col1|col2 |
+----+-----+
|   a| null|
|   b| null|
|   c| null|
|   c|    0| 
|   a|    2|
|   b|    2|   
|   a|    1| 
+----+-----+

I'm struggling with this for past one week. Tried window functions and many things in spark. But couldn't get anything. It would be a great help if someone knows how to fix this. Thank you.

Comment if you need any clarification in the question.

I am providing solution, with some assumptions.

Assuming, previous reference can be found in max of previous 'n' rows. If 'n' is reasonable less value, i think this is good solution.

I assumed you can find the previous reference in 5 rows.

def get_distincts(list, current_value):
    cnt = {}
    flag = False
    for i in list:
        if current_value == i :
            flag = True
            break
        else:
            cnt[i] = "some_value"

    if flag:
        return len(cnt)
    else:
        return None

get_distincts_udf = udf(get_distincts, IntegerType())

df = spark.createDataFrame([["a"],["b"],["c"],["c"],["a"],["b"],["a"]]).toDF("col1")
#You can replace this, if you have some unique id column 
df = df.withColumn("seq_id", monotonically_increasing_id()) 

window = Window.orderBy("seq_id")
df = df.withColumn("list", array([lag(col("col1"),i, None).over(window) for i in range(1,6) ]))

df = df.withColumn("col2", get_distincts_udf(col('list'), col('col1'))).drop('seq_id','list')
df.show()

which results

+----+----+
|col1|col2|
+----+----+
|   a|null|
|   b|null|
|   c|null|
|   c|   0|
|   a|   2|
|   b|   2|
|   a|   1|
+----+----+

You can try the following approach:

  1. add a monotonically_increasing column id to keep track the order of rows
  2. find prev_id for each col1 and save the result to a new df
  3. for the new DF (alias 'd1'), make a LEFT JOIN to the DF itself (alias 'd2') with a condition (d2.id > d1.prev_id) & (d2.id < d1.id)
  4. then groupby( 'd1.col1' , 'd1.id' ) and aggregate on the countDistinct( 'd2.col1' )

The code based on the above logic and your sample data is shown below:

from pyspark.sql import functions as F, Window

df1 = spark.createDataFrame([ (i,) for i in list("abccaba")], ["col1"])

# create a WinSpec partitioned by col1 so that we can find the prev_id
win = Window.partitionBy('col1').orderBy('id')

# set up id and prev_id
df11 = df1.withColumn('id', F.monotonically_increasing_id())\
          .withColumn('prev_id', F.lag('id').over(win))

# check the newly added columns
df11.sort('id').show()
# +----+---+-------+
# |col1| id|prev_id|
# +----+---+-------+
# |   a|  0|   null|
# |   b|  1|   null|
# |   c|  2|   null|
# |   c|  3|      2|
# |   a|  4|      0|
# |   b|  5|      1|
# |   a|  6|      4|
# +----+---+-------+

# let's cache the new dataframe
df11.persist()

# do a self-join on id and prev_id and then do the aggregation
df12 = df11.alias('d1') \
           .join(df11.alias('d2')
               , (F.col('d2.id') > F.col('d1.prev_id')) & (F.col('d2.id') < F.col('d1.id')), how='left') \
           .select('d1.col1', 'd1.id', F.col('d2.col1').alias('ids')) \
           .groupBy('col1','id') \
           .agg(F.countDistinct('ids').alias('distinct_values'))

# display the result
df12.sort('id').show()
# +----+---+---------------+
# |col1| id|distinct_values|
# +----+---+---------------+
# |   a|  0|              0|
# |   b|  1|              0|
# |   c|  2|              0|
# |   c|  3|              0|
# |   a|  4|              2|
# |   b|  5|              2| 
# |   a|  6|              1|
# +----+---+---------------+

# release the cached df11
df11.unpersist()

Note you will need to keep this id column to sort rows, otherwise your resulting rows will be totally messed up each time you collect them.

reuse_distance = []

block_dict = {}
stack_dict = {}
counter_reuse = 0
counter_stack = 0
reuse_list = []

Here block is nothing but the characters you want to read and search from csv

stack_list = []
        stack_dist = -1
        reuse_dist = -1
        if block in block_dict:
            reuse_dist = counter_reuse - block_dict[block]-1
            block_dict[block] = counter_reuse
            counter_reuse += 1
            stack_dist_ind= stack_list.index(block)

            stack_dist = counter_stack -stack_dist_ind - 1
            del stack_list[stack_dist_ind]
            stack_list.append(block)

        else:
            block_dict[block] = counter_reuse
            counter_reuse += 1
            counter_stack += 1
            stack_list.append(block)

        reuse_distance_2.append([block, stack_dist, reuse_dist])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM