[英]CSV load into Dataframe with filename as additional column in pyspark
I'm trying to create a dataframe from a directory full of csv files but I want to keep the filename of each file on the dataframe as an additional column, is that possible on pyspark without using pandas?, also I want to remove the path從文件名。
from pyspark.sql.functions import input_file_name
df = spark.read.option("delimiter", "\t").csv(mount_point_input)
df_.withColumn("filename", input_file_name())
我嘗試使用 input_file_name() 但 dataframe 上的所有行都有相同的文件名。
輸入:
False 2021-06-05T14:45:09 Server True
True 2021-06-02T21:32:42 Server True
Output:
+-----+-----------------------+-------+-------+--------------------------------+
False 2021-06-05T14:45:09 Server True /2021-06-02-general/c32d3f47.csv
+-----+-----------------------+-------+-------+--------------------------------+
False 2021-06-02T21:32:42 Server True /2021-06-02-general/c32d3f47.csv
+-----+-----------------------+-------+-------+--------------------------------+
預期 output:
+-----+-----------------------+-------+-------+--------------------------------+
False 2021-06-05T14:45:09 Server True c32d3f47.csv
+-----+-----------------------+-------+-------+--------------------------------+
False 2021-06-02T21:32:42 Server True c32d3f48.csv
+-----+-----------------------+-------+-------+--------------------------------+
您可以在 UDF 中使用os.path.basename -
>>> from pyspark.sql.functions import input_file_name,udf
>>> from pyspark.sql.types import StringType
>>> from os.path import basename
>>>
>>> data = [("/home/user/test/File1.txt",10),
... ("/home/user/test/File2.txt",20),
... ("/home/user/test/File3.txt",30),
... ("/home/user/test/File4.txt",40),
... ("/2021-06-02-general/c32d3f47.csv",50),
... ("/2021-06-02-general/c32d3f47.csv",50)
... ]
>>>
>>>
>>> cols = ["file_path","dummy_value"]
>>> testDF = spark.createDataFrame(data=data, schema = cols)
>>>
>>> testDF.show(truncate=False)
+--------------------------------+-----------+
|file_path |dummy_value|
+--------------------------------+-----------+
|/home/user/test/File1.txt |10 |
|/home/user/test/File2.txt |20 |
|/home/user/test/File3.txt |30 |
|/home/user/test/File4.txt |40 |
|/2021-06-02-general/c32d3f47.csv|50 |
|/2021-06-02-general/c32d3f47.csv|50 |
+--------------------------------+-----------+
>>>
>>>
>>> @udf(StringType())
... def return_filename(inp):
... if inp:
... return basename(inp)
... else:
... return None
...
>>> testDF = testDF.withColumn("file_name", return_filename('file_path'))
>>> testDF.show(truncate=False)
+--------------------------------+-----------+------------+
|file_path |dummy_value|file_name |
+--------------------------------+-----------+------------+
|/home/user/test/File1.txt |10 |File1.txt |
|/home/user/test/File2.txt |20 |File2.txt |
|/home/user/test/File3.txt |30 |File3.txt |
|/home/user/test/File4.txt |40 |File4.txt |
|/2021-06-02-general/c32d3f47.csv|50 |c32d3f47.csv|
|/2021-06-02-general/c32d3f47.csv|50 |c32d3f47.csv|
+--------------------------------+-----------+------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.