简体   繁体   中英

Convert Data Frame to string in pyspark

I would like to convert Pandas Data Frame to string so that i can use in regex

Input Data:

SRAVAN
KUMAR
RAKESH
SOHAN

import re

import pandas as pd

file = spark.read.text("hdfs://test.txt")

pands = file.toPandas()

schema: pysark.sql.dataframe.DataFrame

result = re.sub(r"\n","",pands,0,re.MULTILINE)

print(result)

SRAVANKUMAR
RAKESHSOHAN

You don't need Pandas for this. Spark has its own regex replace function.

This will replace \n in every row with an empty string.

By default, spark.read.text will read each line of the file into one dataframe row, so you cannot have a multi-line string value, anyway...

from pyspark.sql.functions import col, regexp_replace

df = spark.read.text("hdfs://test.txt")
df = df.select(regexp_replace(col('value'), '\n', ''))
df.show()

To get the dataframe into a joined string, collect the dataframe. But this should be avoided for large datasets.

s = '\n'.join(d['value'] for d in df.collect())

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM