[英]PySpark; Split a column of lists into multiple columns
這個問題類似於 Pandas here中已經提出的問題。 我正在使用 Google Cloud DataProc 集群來執行 function ,因此無法將它們轉換為pandas
。
我想轉換以下內容:
+----+----------------------------------+-----+---------+------+--------------------+-------------+
| key| value|topic|partition|offset| timestamp|timestampType|
+----+----------------------------------+-----+---------+------+--------------------+-------------+
|null|["sepal_length","sepal_width",...]| iris| 0| 289|2021-04-11 22:32:...| 0|
|null|["5.0","3.5","1.3","0.3","setosa"]| iris| 0| 290|2021-04-11 22:32:...| 0|
|null|["4.5","2.3","1.3","0.3","setosa"]| iris| 0| 291|2021-04-11 22:32:...| 0|
|null|["4.4","3.2","1.3","0.2","setosa"]| iris| 0| 292|2021-04-11 22:32:...| 0|
|null|["5.0","3.5","1.6","0.6","setosa"]| iris| 0| 293|2021-04-11 22:32:...| 0|
|null|["5.1","3.8","1.9","0.4","setosa"]| iris| 0| 294|2021-04-11 22:32:...| 0|
|null|["4.8","3.0","1.4","0.3","setosa"]| iris| 0| 295|2021-04-11 22:32:...| 0|
+----+----------------------------------+-----+---------+------+--------------------+-------------+
變成這樣:
+--------------+-------------+--------------+-------------+-------+
| sepal_length | sepal_width | petal_length | petal_width | class |
+--------------+-------------+--------------+-------------+-------+
| 5.0 | 3.5 | 1.3 | 0.3 | setosa|
| 4.5 | 2.3 | 1.3 | 0.3 | setosa|
| 4.4 | 3.2 | 1.3 | 0.2 | setosa|
| 5.0 | 3.5 | 1.6 | 0.6 | setosa|
| 5.1 | 3.8 | 1.9 | 0.4 | setosa|
| 4.8 | 3.0 | 1.4 | 0.3 | setosa|
+--------------+-------------+--------------+-------------+-------+
我該怎么做呢? 任何幫助將不勝感激!
走了很長的路,因為 py spark 相對較新。 很高興知道是否有更短的方法
在 pandas 中重新創建了您的 dataframe
df = pd.DataFrame({"value":['["sepal_length","sepal_width","petal_length","petal_width","class"]','["5.0","3.5","1.3","0.3","setosa"]','["4.5","2.3","1.3","0.3","setosa"]','["4.4","3.2","1.3","0.2","setosa"]']})
將 pandas 數據幀轉換為 sdf
sdf = spark.createDataFrame(df)
我剝去角括號和"
sdf = sdf.withColumn('value', regexp_replace(col('value'), '[\\[\\"\\]]', "")) sdf.show(truncate=False)
我用,
分割 datframe
df_split = sdf.select(f.split(sdf.value,",")).rdd.flatMap( lambda x: x).toDF(schema=["sepal_length","sepal_width","petal_length","petal_width","class"])
5:過濾掉非數字
df_split = df_split.filter(df_split.sepal_length != "sepal_length")
df_split.show()
+------------+-----------+------------+-----------+------+
|sepal_length|sepal_width|petal_length|petal_width| class|
+------------+-----------+------------+-----------+------+
| 5.0| 3.5| 1.3| 0.3|setosa|
| 4.5| 2.3| 1.3| 0.3|setosa|
| 4.4| 3.2| 1.3| 0.2|setosa|
+------------+-----------+------------+-----------+------+
經過大量的搜索,我終於寫了一個代碼,以“dataproc”的方式解決它。 代碼如下:
from pyspark.sql import SparkSession, Row
from pyspark.sql.functions import split, explode, col, regexp_replace, udf
from pyspark.sql import functions as f
spark = SparkSession \
.builder \
.appName("appName") \
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "ip:port") \
.option("subscribe", "topic-name") \
.load()
data = df.select([c for c in df.columns if c in ["value", "offset"]])
def convertType(val):
arr = val.decode("utf-8").split(",")
print(arr[0], arr[1], arr[2], arr[3])
print("="*50)
arr[0], arr[1], arr[2], arr[3] = float(arr[0][2:-1]), float(arr[1][2:-1]), float(arr[2][2:-1]), float(arr[3][2:-1])
arr[4] = arr[4][:-1]
return arr
def get_sepal_length(arr):
val = arr[0]
return val
def get_sepal_width(arr):
val = arr[1]
return val
def get_petal_length(arr):
val = arr[2]
return val
def get_petal_width(arr):
val = arr[3]
return val
def get_classes(arr):
val = arr[4][2:-1]
return val
convertUDF = udf(lambda z: convertType(z))
getSL = udf(lambda z: get_sepal_length(z))
getSW = udf(lambda z: get_sepal_width(z))
getPL = udf(lambda z: get_petal_length(z))
getPW = udf(lambda z: get_petal_width(z))
getC = udf(lambda z: get_classes(z))
df_new = data.select(col("offset"), \
convertUDF(col("value")).alias("value"))
df_new = df_new.withColumn("sepal_length", getSL(col("value")).cast("float"))
df_new = df_new.withColumn("sepal_width", getSW(col("value")).cast("float"))
df_new = df_new.withColumn("petal_length", getPL(col("value")).cast("float"))
df_new = df_new.withColumn("petal_width", getPW(col("value")).cast("float"))
df_new = df_new.withColumn("classes", getC(col("value")))
query = df_new\
.writeStream \
.format("console") \
.start()
query.awaitTermination()
請注意, arr[i][2:-1], ...
是由於df.value
中的數據格式。 就我而言,它是'"2.56"
。 udf
有很大的限制,冗長的 udf 方法是我能找到的最好方法:)。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.