簡體   English   中英

CSV 加載到 Dataframe 中,文件名作為 pyspark 中的附加列

[英]CSV load into Dataframe with filename as additional column in pyspark

I'm trying to create a dataframe from a directory full of csv files but I want to keep the filename of each file on the dataframe as an additional column, is that possible on pyspark without using pandas?, also I want to remove the path從文件名。

from pyspark.sql.functions import input_file_name

df = spark.read.option("delimiter", "\t").csv(mount_point_input)
df_.withColumn("filename", input_file_name())

我嘗試使用 input_file_name() 但 dataframe 上的所有行都有相同的文件名。

輸入:

False    2021-06-05T14:45:09     Server       True
True     2021-06-02T21:32:42     Server       True

Output:

+-----+-----------------------+-------+-------+--------------------------------+
False  2021-06-05T14:45:09     Server   True   /2021-06-02-general/c32d3f47.csv
+-----+-----------------------+-------+-------+--------------------------------+
False  2021-06-02T21:32:42     Server   True   /2021-06-02-general/c32d3f47.csv
+-----+-----------------------+-------+-------+--------------------------------+

預期 output:

+-----+-----------------------+-------+-------+--------------------------------+
False  2021-06-05T14:45:09     Server   True   c32d3f47.csv
+-----+-----------------------+-------+-------+--------------------------------+
False  2021-06-02T21:32:42     Server   True   c32d3f48.csv
+-----+-----------------------+-------+-------+--------------------------------+

您可以在 UDF 中使用os.path.basename -

>>> from pyspark.sql.functions import input_file_name,udf
>>> from pyspark.sql.types import StringType
>>> from os.path import basename
>>> 
>>> data = [("/home/user/test/File1.txt",10), 
...         ("/home/user/test/File2.txt",20), 
...         ("/home/user/test/File3.txt",30), 
...         ("/home/user/test/File4.txt",40),
...         ("/2021-06-02-general/c32d3f47.csv",50),
...         ("/2021-06-02-general/c32d3f47.csv",50)
...         ]
>>> 
>>> 
>>> cols = ["file_path","dummy_value"]
>>> testDF = spark.createDataFrame(data=data, schema = cols)
>>> 
>>> testDF.show(truncate=False)
+--------------------------------+-----------+
|file_path                       |dummy_value|
+--------------------------------+-----------+
|/home/user/test/File1.txt       |10         |
|/home/user/test/File2.txt       |20         |
|/home/user/test/File3.txt       |30         |
|/home/user/test/File4.txt       |40         |
|/2021-06-02-general/c32d3f47.csv|50         |
|/2021-06-02-general/c32d3f47.csv|50         |
+--------------------------------+-----------+

>>> 
>>> 
>>> @udf(StringType())
... def return_filename(inp):
...     if inp:
...       return basename(inp)
...     else:
...       return None
... 
>>> testDF = testDF.withColumn("file_name", return_filename('file_path'))
>>> testDF.show(truncate=False)
+--------------------------------+-----------+------------+
|file_path                       |dummy_value|file_name   |
+--------------------------------+-----------+------------+
|/home/user/test/File1.txt       |10         |File1.txt   |
|/home/user/test/File2.txt       |20         |File2.txt   |
|/home/user/test/File3.txt       |30         |File3.txt   |
|/home/user/test/File4.txt       |40         |File4.txt   |
|/2021-06-02-general/c32d3f47.csv|50         |c32d3f47.csv|
|/2021-06-02-general/c32d3f47.csv|50         |c32d3f47.csv|
+--------------------------------+-----------+------------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM