[英]Column combinations using PySpark or Pandas based on conditions
I want to generate combinations for the following 3 lists such that beds >= baths >= cars, and the combinations are then projected into 3 dataframe columns.我想为以下 3 个列表生成组合,这样床 >= 浴室 >= 汽车,然后将这些组合投影到 3 dataframe 列中。 How to achieve this?如何做到这一点?
beds = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]
baths = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]
cars = [0, 1, 2, 3, 4, 5, 6, 7, 8]
Desired result:期望的结果:
nbed nbath ncar
0 1.0 1.0 0
1 1.0 1.0 1
2 2.0 1.0 0
3 2.0 2.0 0
4 2.0 1.0 1
5 2.0 2.0 1
6 2.0 1.0 2
7 2.0 2.0 2
8 3.0 1.0 0
9 3.0 2.0 0
10 3.0 3.0 0
11 3.0 1.0 1
12 3.0 2.0 1
13 3.0 3.0 1
14 3.0 1.0 2
15 3.0 2.0 2
16 3.0 3.0 2
17 3.0 1.0 3
18 3.0 2.0 3
19 3.0 3.0 3
You can do a simple join and then query:你可以做一个简单的连接然后查询:
df1,df2,df3 = [pd.DataFrame({name:lst} ) for lst,name in zip([beds,baths,cars], ['bed', 'bath', 'beyond'])]
df_123 = df1.join(df2, how='cross').join(df3, how='cross')
df_123.query("bed >= bath >= beyond").reset_index(drop=True)
# bed bath beyond
#0 1.0 1.0 0
#1 1.0 1.0 1
#2 2.0 1.0 0
#3 2.0 1.0 1
#4 2.0 2.0 0
#.. ... ... ...
#151 8.0 8.0 4
#152 8.0 8.0 5
#153 8.0 8.0 6
#154 8.0 8.0 7
#155 8.0 8.0 8
One option is with conditional_join from pyjanitor and it is usually more efficient than a cartesian join for such inequality (non-equi) joins:一种选择是使用pyjanitor的conditional_join ,对于这种不等式(非等值)连接,它通常比笛卡尔连接更有效:
# pip install pyjanitor
import janitor
import pandas as pd
# convert them to integer type,
# to make all data types uniform
beds = pd.DataFrame({'beds': beds}, dtype = int)
cars = pd.Series(cars, name = 'cars', dtype = int)
baths = pd.Series(baths, name = 'baths', dtype = int)
(beds
.conditional_join(
baths,
("beds", "baths", ">="))
.conditional_join(
cars,
("baths", "cars", ">="))
)
beds baths cars
0 1 1 0
1 1 1 1
2 2 1 0
3 2 1 1
4 2 2 0
.. ... ... ...
151 8 8 4
152 8 8 5
153 8 8 6
154 8 8 7
155 8 8 8
[156 rows x 3 columns]
PySpark. It seems easy to create combinations using Python's itertools.product
. PySpark。使用 Python 的itertools.product
创建组合似乎很容易。 You can provide the result to spark.createDataFrame
and then use your filter
.您可以将结果提供给spark.createDataFrame
,然后使用您的filter
。
Input:输入:
import itertools
beds = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]
baths = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0]
cars = [0, 1, 2, 3, 4, 5, 6, 7, 8]
Script:脚本:
df = spark.createDataFrame(
[e for e in itertools.product(*[beds, baths, cars])],
['nbed', 'nbath', 'ncar']
)
df = df.filter("nbed >= nbath and nbath >= ncar")
df.show()
# +----+-----+----+
# |nbed|nbath|ncar|
# +----+-----+----+
# | 1.0| 1.0| 0|
# | 1.0| 1.0| 1|
# | 2.0| 1.0| 0|
# | 2.0| 1.0| 1|
# | 2.0| 2.0| 0|
# | 2.0| 2.0| 1|
# | 2.0| 2.0| 2|
# | 3.0| 1.0| 0|
# | 3.0| 1.0| 1|
# | 3.0| 2.0| 0|
# | 3.0| 2.0| 1|
# | 3.0| 2.0| 2|
# | 3.0| 3.0| 0|
# | 3.0| 3.0| 1|
# | 3.0| 3.0| 2|
# | 3.0| 3.0| 3|
# | 4.0| 1.0| 0|
# | 4.0| 1.0| 1|
# | 4.0| 2.0| 0|
# | 4.0| 2.0| 1|
# +----+-----+----+
# only showing top 20 rows
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.