在pyspark中将数据框转换为字符串

Question

I would like to convert Pandas Data Frame to string so that i can use in regex我想将 Pandas 数据框转换为字符串，以便在正则表达式中使用

Input Data:输入数据：

SRAVAN
KUMAR
RAKESH
SOHAN

import re

import pandas as pd

file = spark.read.text("hdfs://test.txt")

pands = file.toPandas()

schema: pysark.sql.dataframe.DataFrame

result = re.sub(r"\n","",pands,0,re.MULTILINE)

print(result)

SRAVANKUMAR
RAKESHSOHAN

Answer 1

You don't need Pandas for this.你不需要 Pandas。 Spark has its own regex replace function. Spark 有自己的正则表达式替换功能。

This will replace \n in every row with an empty string.这将用空字符串替换每一行中的\n 。

By default, spark.read.text will read each line of the file into one dataframe row, so you cannot have a multi-line string value, anyway...默认情况下， spark.read.text会将文件的每一行读入一个数据框行，所以无论如何你不能有一个多行字符串值......

from pyspark.sql.functions import col, regexp_replace

df = spark.read.text("hdfs://test.txt")
df = df.select(regexp_replace(col('value'), '\n', ''))
df.show()

To get the dataframe into a joined string, collect the dataframe.要将数据框转换为连接字符串，请收集数据框。 But this should be avoided for large datasets.但是对于大型数据集应该避免这种情况。

s = '\n'.join(d['value'] for d in df.collect())

在pyspark中将数据框转换为字符串

问题描述

1 个解决方案

解决方案1
0 2022-06-10 02:23:02

在pyspark中将数据框转换为字符串

问题描述

1 个解决方案

解决方案1 0 2022-06-10 02:23:02

解决方案1
0 2022-06-10 02:23:02