繁体   English   中英

将标签字符串转换为二进制向量 pyspark

[英]Convert strings of tags to binary vector pyspark

我有看起来像这样的数据:

| Id | ----Tags---- | some_text |
| 0  | <a><b>       | ex1       |
| 1  | <a><c>       | ex2       |
| 2  | <b><c>       | ex3       |

我希望它最终看起来像这样:

| Id | a | b | c | some_text |
| 0  | 1 | 1 | 0 | ex1       |
| 1  | 1 | 0 | 1 | ex2       |
| 2  | 0 | 1 | 1 | ex3       |

我想使用 pyspark 作为解决方案。 关于如何解决这个问题的任何想法?

If you don't already know the expected categorical values, you can use pyspark.sql.functions.udf to split and tags into an array of values and pyspark.sql.functions.explode function to convert them to columns. 然后,您可以将 pivot 值添加到列中:

# required imports
import pyspark.sql.functions as F
from pyspark.sql.types import ArrayType, StringType
import re

# regex pattern to split 'tagged values'
pat = re.compile('<(.*?)>')

#udf to split string to array of values
split_f = f.udf(lambda s: pat.split(s), ArrayType(StringType()))

# sample data
df = spark.createDataFrame([(0,'<a><b>','ex1'),(1,'<a><c>','ex2')], ['Id', '---Tags---', 'some_text'])

+---+----------+---------+
| Id|---Tags---|some_text|
+---+----------+---------+
|  0|    <a><b>|      ex1|
|  1|    <a><c>|      ex2|
+---+----------+---------+

df.withColumn('exploded', 
   F.explode(split_f(F.col('---Tags---'))))
  .groupby('Id').pivot('exploded').count().na.fill(0).show()

+---+---+---+---+
| Id|  a|  b|  c|
+---+---+---+---+
|  0|  1|  1|  0|
|  1|  1|  0|  1|
+---+---+---+---+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM