[英]pyspark split a column to multiple columns without pandas
my question is how to split a column to multiple columns.我的问题是如何将一列拆分为多列。 I don't know why df.toPandas()
does not work.我不知道为什么df.toPandas()
不起作用。
For example, I would like to change 'df_test' to 'df_test2'.例如,我想将“df_test”更改为“df_test2”。 I saw many examples using the pandas module.我看到很多使用 pandas 模块的例子。 Is there another way?还有别的办法吗? Thank you in advance.先感谢您。
df_test = sqlContext.createDataFrame([
(1, '14-Jul-15'),
(2, '14-Jun-15'),
(3, '11-Oct-15'),
], ('id', 'date'))
df_test2 df_test2
id day month year
1 14 Jul 15
2 14 Jun 15
1 11 Oct 15
Spark >= 2.2火花 >= 2.2
You can skip unix_timestamp
and cast and use to_date
or to_timestamp
:您可以跳过unix_timestamp
并强制转换并使用to_date
或to_timestamp
:
from pyspark.sql.functions import to_date, to_timestamp
df_test.withColumn("date", to_date("date", "dd-MMM-yy")).show()
## +---+----------+
## | id| date|
## +---+----------+
## | 1|2015-07-14|
## | 2|2015-06-14|
## | 3|2015-10-11|
## +---+----------+
df_test.withColumn("date", to_timestamp("date", "dd-MMM-yy")).show()
## +---+-------------------+
## | id| date|
## +---+-------------------+
## | 1|2015-07-14 00:00:00|
## | 2|2015-06-14 00:00:00|
## | 3|2015-10-11 00:00:00|
## +---+-------------------+
and then apply other datetime functions shown below.然后应用下面显示的其他日期时间函数。
Spark < 2.2火花 < 2.2
It is not possible to derive multiple top level columns in a single access.不可能在一次访问中派生出多个顶级列。 You can use structs or collection types with an UDF like this:您可以像这样将结构或集合类型与 UDF 一起使用:
from pyspark.sql.types import StringType, StructType, StructField
from pyspark.sql import Row
from pyspark.sql.functions import udf, col
schema = StructType([
StructField("day", StringType(), True),
StructField("month", StringType(), True),
StructField("year", StringType(), True)
])
def split_date_(s):
try:
d, m, y = s.split("-")
return d, m, y
except:
return None
split_date = udf(split_date_, schema)
transformed = df_test.withColumn("date", split_date(col("date")))
transformed.printSchema()
## root
## |-- id: long (nullable = true)
## |-- date: struct (nullable = true)
## | |-- day: string (nullable = true)
## | |-- month: string (nullable = true)
## | |-- year: string (nullable = true)
but it is not only quite verbose in PySpark, but also expensive.但它在 PySpark 中不仅非常冗长,而且代价高昂。
For date based transformations you can simply use built-in functions:对于基于日期的转换,您可以简单地使用内置函数:
from pyspark.sql.functions import unix_timestamp, dayofmonth, year, date_format
transformed = (df_test
.withColumn("ts",
unix_timestamp(col("date"), "dd-MMM-yy").cast("timestamp"))
.withColumn("day", dayofmonth(col("ts")).cast("string"))
.withColumn("month", date_format(col("ts"), "MMM"))
.withColumn("year", year(col("ts")).cast("string"))
.drop("ts"))
Similarly you could use regexp_extract
to split date string.同样,您可以使用regexp_extract
来拆分日期字符串。
See also Derive multiple columns from a single column in a Spark DataFrame另请参阅从 Spark DataFrame 中的单个列派生多个列
Note :注意:
If you use version not patched against SPARK-11724 this will require correction after unix_timestamp(...)
and before cast("timestamp")
.如果您使用未针对SPARK-11724打补丁的版本,这将需要在unix_timestamp(...)
之后和cast("timestamp")
之前进行更正。
The Solution here is to use pyspark.sql.functions.split() function.这里的解决方案是使用 pyspark.sql.functions.split() 函数。
df = sqlContext.createDataFrame([
(1, '14-Jul-15'),
(2, '14-Jun-15'),
(3, '11-Oct-15'),
], ('id', 'date'))
split_col = pyspark.sql.functions.split(df['date'], '-')
df = df.withColumn('day', split_col.getItem(0))
df = df.withColumn('month', split_col.getItem(1))
df = df.withColumn('year', split_col.getItem(2))
df = df.drop("date")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.