[英]How to split a spark dataframe column of ArrayType(StructType) to multiple columns in pyspark?
[英]Pyspark Split Dataframe string column into multiple columns
我在 spark 3.0.0 上執行 Spark Structure 流的示例,為此,我使用 twitter 數據。 我已經在Kafka中推送了twitter數據,單條記錄看起來像這樣
2020-07-21 10:48:19|1265200268284588034|RT @narendramodi:與@IBM 的首席執行官@ArvindKrishna 先生進行了廣泛的互動。 我們討論了幾個與技術有關的主題,……|印度海得拉巴
這里每個字段都用'|'分隔字段是
約會時間
用戶身份
推文
地點
現在在 Spark 中閱讀這條消息我得到了這樣的數據框
key | value
-----+-------------------------
| 2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India
根據這個答案,我在我的應用程序中添加了以下代碼塊。
split_col = pyspark.sql.functions.split(df['value'], '|')
df = df.withColumn("Tweet Time", split_col.getItem(0))
df = df.withColumn("User ID", split_col.getItem(1))
df = df.withColumn("Tweet Text", split_col.getItem(2))
df = df.withColumn("Location", split_col.getItem(3))
df = df.drop("key")
但它給了我這樣的 output,
A | B | C | D | E |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+---------+--------+-----+
2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|2 | 0 | 2 | 0 |
但我想要這樣的 output
Tweet Time | User ID | Tweet text | Location |
-----------------------+-------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+-------------------+
2020-07-21 10:48:19 | 1265200268284588034 | RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,… | Hyderabad, India |
因為它接受一個模式:一個表示正則表達式的字符串。 正則表達式字符串應該是 Java 正則表達式。
使用"\\|"
按 pipe 或'[|]'
分割
split_col = split(df.value, '\\|',)
df = df.withColumn("Tweet Time", split_col.getItem(0))\
.withColumn("User ID", split_col.getItem(1))\
.withColumn("Tweet Text", split_col.getItem(2))\
.withColumn("Location", split_col.getItem(3))\
.drop("key")
Output:
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------+----------------+
|value |Tweet Time |User ID |Tweet Text |Location |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------+----------------+
|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|
|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|2020-07-21 10:48:19|1265200268284588034|RT @narendramodi: Had an extensive interaction with CEO of @IBM, Mr. @ArvindKrishna. We discussed several subjects relating to technology,…|Hyderabad, India|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+-------------------+-------------------------------------------------------------------------------------------------------------------------------------------+----------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.