简体   繁体   中英

pyspark group and split dataframe

I am trying to filter and then split a dataset into two separate files.

Dataset: test.txt (Schema: uid, prod, score)

1   XYZ 2.0
2   ABC 0.5
1   PQR 1.0
2   XYZ 2.1
3   PQR 0.5
1   ABC 0.5

First, I want to filter any uid having less than or equal to 1 product. I have already achieved that by the following code.

from pyspark.sql.types import *
from pyspark.sql.functions import *

rdd = sc.textFile('test.txt').map(lambda row: row.split('\t'))
schema = StructType([
           StructField('uid', IntegerType(), True),
           StructField('prod', StringType(), True),
           StructField('score', FloatType(), True)])
df = rdd.toDF([f.name for f in schema.fields])
filtered = df.groupby('uid').count().withColumnRenamed("count", "n").filter("n >= 2")
all_data = df.join(filtered, df.uid == filtered.uid , 'inner').drop(filtered.uid).drop(filtered.n)
all_data.show()

This produces the following output:

+----+-----+---+
|prod|score|uid|
+----+-----+---+
| XYZ|  2.0|  1|
| PQR|  1.0|  1|
| ABC|  0.5|  1|
| ABC|  0.5|  2|
| XYZ|  2.1|  2|
+----+-----+---+

I need to now create 2 files out of the above dataframe. The problem that I am now facing is what is the best way to take one row for each product (can be any row) and put it in a different file(val.txt) and the rest of the rows in a different file (train.txt).

Expected output (train.txt)

1    XYZ    2.0
1    PQR    1.0
2    ABC    0.5

Expected output (val.txt)

1    ABC    0.5
2    XYZ    2.1

Thanks in advance !

I think the key issue here is that you don't have a primary key for your data.

all_data = all_data.withColumn(
    'id',
    monotonically_increasing_id()
)

train = all_data.dropDuplicates(['prod'])

# could OOM if your dataset is too big
# consider BloomFilter if so
all_id = {row['id'] for row in all_data.select('id').collect()}
train_id = {row['id'] for row in train.select('id').collect()}
val_id = all_id - train_id

val = all_data.where(col(id).isin(val_id))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM