[英]Convert Data Frame to string in pyspark
I would like to convert Pandas Data Frame to string so that i can use in regex我想将 Pandas 数据框转换为字符串,以便在正则表达式中使用
Input Data:输入数据:
SRAVAN
KUMAR
RAKESH
SOHAN
import re
import pandas as pd
file = spark.read.text("hdfs://test.txt")
pands = file.toPandas()
schema: pysark.sql.dataframe.DataFrame
result = re.sub(r"\n","",pands,0,re.MULTILINE)
print(result)
SRAVANKUMAR
RAKESHSOHAN
You don't need Pandas for this.你不需要 Pandas。 Spark has its own regex replace function. Spark 有自己的正则表达式替换功能。
This will replace \n
in every row with an empty string.这将用空字符串替换每一行中的\n
。
By default, spark.read.text
will read each line of the file into one dataframe row, so you cannot have a multi-line string value, anyway...默认情况下, spark.read.text
会将文件的每一行读入一个数据框行,所以无论如何你不能有一个多行字符串值......
from pyspark.sql.functions import col, regexp_replace
df = spark.read.text("hdfs://test.txt")
df = df.select(regexp_replace(col('value'), '\n', ''))
df.show()
To get the dataframe into a joined string, collect the dataframe.要将数据框转换为连接字符串,请收集数据框。 But this should be avoided for large datasets.但是对于大型数据集应该避免这种情况。
s = '\n'.join(d['value'] for d in df.collect())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.