Given the following DataFrame we need to interpolate my_column
values from the example and use them as separate columns and then sort by the int_column
values that belong to each some_id
column in descending order. The example:
+--------------------+-----------+------------------+
| some_id | my_column | int_column |
+--------------------+-----------+------------------+
|xx1 |id_1 | 3 |
|xx1 |id_2 | 4 |
|xx1 |id_3 | 5 |
|xx2 |id_1 | 6 |
|xx2 |id_2 | 1 |
|xx2 |id_3 | 3 |
|xx3 |id_1 | 4 |
|xx3 |id_2 | 8 |
|xx3 |id_3 | 9 |
|xx4 |id_1 | 1 |
+--------------------+-----------+------------------+
Expected output:
+--------------------+-----------+------------------+
| id_1 | id_2 | id_3 |
+--------------------+-----------+------------------+
| [xx4, 1] |[xx2, 1] |[xx2, 3] |
| [xx1, 3] |[xx1, 4] |[xx1, 5] |
| [xx3, 4] |[xx3, 8] |[xx3, 9] |
| [xx2, 6] |null |null |
+--------------------+-----------+------------------+
As you can see, for id_1
the lowest number in int_column
is 1 right at the end of the DataFrame and it belongs to xx4
from the some_id
column, the next value is 3, 4, and 6, each belonging to xx1, xx3, and xx2 respectively.
Any pointers on how to approach this problem? Either PySpark or Pandas can be used.
Code to reproduce the input dataframe:
import pandas as pd
data = {'some_id': ['xx1', 'xx1', 'xx1', 'xx2', 'xx2', 'xx2', 'xx3', 'xx3', 'xx3', 'xx4'], \
'my_column' : ['id_1', 'id_2', 'id_3', 'id_1', 'id_2', 'id_3', 'id_1', 'id_2', 'id_3', 'id_1'],\
'int_column' : [3, 4, 5, 6 , 1, 3, 4, 8, 9, 1]}
df = pd.DataFrame.from_dict(data)
We need a helper key , create by using cumcount
, then we using groupby
+ apply
(This part just like pivot
, or you can using pivot_table
or crosstab
)
df=df.assign(key=df.groupby('my_column').cumcount())
df.groupby(['key','my_column']).apply(lambda x : list(zip(x['some_id'],x['int_column']))[0]).unstack()
Out[378]:
my_column id_1 id_2 id_3
key
0 (xx1, 3) (xx1, 4) (xx1, 5)
1 (xx2, 6) (xx2, 1) (xx2, 3)
2 (xx3, 4) (xx3, 8) (xx3, 9)
3 (xx4, 1) None None
If using pivot
+ sort_values
df=df.sort_values('int_column').assign(key=df.groupby('my_column').cumcount())
df['Value']=list(zip(df['some_id'],df['int_column']))
s=df.pivot(index='key',columns='my_column',values='Value')
s
Out[397]:
my_column id_1 id_2 id_3
key
0 (xx4, 1) (xx2, 1) (xx2, 3)
1 (xx1, 3) (xx1, 4) (xx1, 5)
2 (xx3, 4) (xx3, 8) (xx3, 9)
3 (xx2, 6) None None
Here's a solution in pyspark.
First define a Window
to partition by my_column
and order by int_column
. We will define an ordering using pyspark.sql.functions.row_number()
over this partition.
from pyspark.sql import Window
import pyspark.sql.functions as f
w = Window.partitionBy("my_column").orderBy("int_column")
df.withColumn("order", f.row_number().over(w)).sort("order").show()
#+-------+---------+----------+-----+
#|some_id|my_column|int_column|order|
#+-------+---------+----------+-----+
#| xx4| id_1| 1| 1|
#| xx2| id_2| 1| 1|
#| xx2| id_3| 3| 1|
#| xx1| id_2| 4| 2|
#| xx1| id_1| 3| 2|
#| xx1| id_3| 5| 2|
#| xx3| id_2| 8| 3|
#| xx3| id_3| 9| 3|
#| xx3| id_1| 4| 3|
#| xx2| id_1| 6| 4|
#+-------+---------+----------+-----+
Notice that (xx4, 1)
is in the first row after sorting by order
, as you explained.
Now you can group by order
and pivot
the dataframe on my_column
. This requires an aggregate function, so I will use pyspark.sql.functions.first()
because I am assuming there is only one (some_id, int_column)
pair per order
. Then simply sort by the order
and drop that column to get the desired output:
df.withColumn("order", f.row_number().over(w))\
.groupBy("order")\
.pivot("my_column")\
.agg(f.first(f.array([f.col("some_id"), f.col("int_column")])))\
.sort("order")\
.drop("order")\
.show(truncate=False)
#+--------+--------+--------+
#|id_1 |id_2 |id_3 |
#+--------+--------+--------+
#|[xx4, 1]|[xx2, 1]|[xx2, 3]|
#|[xx1, 3]|[xx1, 4]|[xx1, 5]|
#|[xx3, 4]|[xx3, 8]|[xx3, 9]|
#|[xx2, 6]|null |null |
#+--------+--------+--------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.