Convert Data Frame to string in pyspark

Question

I would like to convert Pandas Data Frame to string so that i can use in regex

Input Data:

SRAVAN
KUMAR
RAKESH
SOHAN

import re

import pandas as pd

file = spark.read.text("hdfs://test.txt")

pands = file.toPandas()

schema: pysark.sql.dataframe.DataFrame

result = re.sub(r"\n","",pands,0,re.MULTILINE)

print(result)

SRAVANKUMAR
RAKESHSOHAN

Answer 1

You don't need Pandas for this. Spark has its own regex replace function.

This will replace \n in every row with an empty string.

By default, spark.read.text will read each line of the file into one dataframe row, so you cannot have a multi-line string value, anyway...

from pyspark.sql.functions import col, regexp_replace

df = spark.read.text("hdfs://test.txt")
df = df.select(regexp_replace(col('value'), '\n', ''))
df.show()

To get the dataframe into a joined string, collect the dataframe. But this should be avoided for large datasets.

s = '\n'.join(d['value'] for d in df.collect())

Convert Data Frame to string in pyspark

Question

1 answers

solution1
0 2022-06-10 02:23:02

Convert Data Frame to string in pyspark

Question

1 answers

solution1 0 2022-06-10 02:23:02

solution1
0 2022-06-10 02:23:02