简体   繁体   English

在pyspark中将数据框转换为字符串

[英]Convert Data Frame to string in pyspark

I would like to convert Pandas Data Frame to string so that i can use in regex我想将 Pandas 数据框转换为字符串,以便在正则表达式中使用

Input Data:输入数据:

SRAVAN
KUMAR
RAKESH
SOHAN

import re

import pandas as pd

file = spark.read.text("hdfs://test.txt")

pands = file.toPandas()

schema: pysark.sql.dataframe.DataFrame

result = re.sub(r"\n","",pands,0,re.MULTILINE)

print(result)

SRAVANKUMAR
RAKESHSOHAN

You don't need Pandas for this.你不需要 Pandas。 Spark has its own regex replace function. Spark 有自己的正则表达式替换功能。

This will replace \n in every row with an empty string.这将用空字符串替换每一行中的\n

By default, spark.read.text will read each line of the file into one dataframe row, so you cannot have a multi-line string value, anyway...默认情况下, spark.read.text会将文件的每一行读入一个数据框行,所以无论如何你不能有一个多行字符串值......

from pyspark.sql.functions import col, regexp_replace

df = spark.read.text("hdfs://test.txt")
df = df.select(regexp_replace(col('value'), '\n', ''))
df.show()

To get the dataframe into a joined string, collect the dataframe.要将数据框转换为连接字符串,请收集数据框。 But this should be avoided for large datasets.但是对于大型数据集应该避免这种情况。

s = '\n'.join(d['value'] for d in df.collect())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM