簡體   English   中英

如何拆分火花數據框列字符串?

[英]How to split a spark dataframe column string?

我有一個看起來像這樣的數據框:

|--------------------------------------|---------|---------|
|   path                                         |  content|  
|------------------------------------------------|---------|
|    /root/path/main_folder1/folder1/path1.txt   |   Val 1 |      
|------------------------------------------------|---------|
|    /root/path/main_folder1/folder2/path2.txt   |   Val 1 |      
|------------------------------------------------|---------|
|    /root/path/main_folder1/folder2/path3.txt   |   Val 1 |      
|------------------------------------------------|---------|

我想用“/”分割路徑中的列值,並只獲取值直到 /root/path/mainfolder1 我想要的輸出是

|--------------------------------------|---------|---------|---------------------------|
|   path                                         |  content|  root_path                |
|------------------------------------------------|---------|---------------------------|
|    /root/path/main_folder1/folder1/path1.txt   |   Val 1 |  /root/path/main_folder1  |    
|------------------------------------------------|---------|---------------------------|
|    /root/path/main_folder1/folder2/path2.txt   |   Val 1 |  /root/path/main_folder1  |    
|------------------------------------------------|---------|---------------------------|
|    /root/path/main_folder1/folder2/path3.txt   |   Val 1 |  /root/path/main_folder1  |    
|------------------------------------------------|---------|---------------------------|

我知道我必須使用 withColumn split 和 regexp_extract 但我並沒有安靜地了解如何限制 regexp_extract 的輸出。

我必須做什么才能獲得所需的輸出?

您可以使用正則表達式來提取前三個目錄級別。

df.withColumn("root_path", F.regexp_extract(F.col("path"), "^((/\w*){3})",1))\
    .show(truncate=False)

輸出:

+-----------------------------------------+-------+-----------------------+
|path                                     |content|root_path              |
+-----------------------------------------+-------+-----------------------+
|/root/path/main_folder1/folder1/path1.txt|val 1  |/root/path/main_folder1|
|/root/path/main_folder1/folder2/path2.txt|val 2  |/root/path/main_folder1|
|/root/path/main_folder1/folder2/path3.txt|val 3  |/root/path/main_folder1|
+-----------------------------------------+-------+-----------------------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM