如何从pyspark中的字符串中删除特定字符？

Question

I am trying to remove specific character from a string but not able to get any proper solution.我正在尝试从字符串中删除特定字符，但无法获得任何正确的解决方案。 Could you please help me how to do this?你能帮我怎么做吗？

I am loading the data into dataframe using pyspark.我正在使用 pyspark 将数据加载到数据帧中。 One of the column having the extra character which i want to remove.具有我想删除的额外字符的列之一。

Example:例子：

|"\""warfarin was discontinued 3 days ago and xarelto was started when the INR was 2.7, and now the INR is 5.8, should Xarelto be continued or stopped?"|

But in result i want only :但结果我只想要：

|"warfarin was discontinued 3 days ago and xarelto was started when the INR was 2.7, and now the INR is 5.8, should Xarelto be continued or stopped?"|

I am using below code to write dataframe into file:我正在使用以下代码将数据帧写入文件：

df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').save(output_path, escape='\"', sep='|',header='True',nullValue=None)

Answer 1

Try the below to remove punctuation from the starting of your string尝试以下方法从字符串的开头删除标点符号

from string import punctuation
mystring = """\""warfarin was discontinued 3 days ago and xarelto was started when the INR was 2.7, and now the INR is 5.8, should Xarelto be continued or stopped?"""
print(mystring.lstrip(punctuation))

output:输出：

'warfarin was discontinued 3 days ago and xarelto was started when the INR was 2.7, and now the INR is 5.8, should Xarelto be continued or stopped?'

Answer 2

You can use some other escape characters instead of '\\' you can change this to anything else.您可以使用其他一些转义字符代替 '\\' 您可以将其更改为其他任何字符。 If you have option to save file to any other format prefer parquet (or orc) over csv.如果您可以选择将文件保存为任何其他格式，则更喜欢 parquet（或 orc）而不是 csv。

如何从pyspark中的字符串中删除特定字符？

问题描述

2 个解决方案

解决方案1
0 2020-03-23 06:41:36

解决方案2
0 2020-03-23 14:27:44

如何从pyspark中的字符串中删除特定字符？

问题描述

2 个解决方案

解决方案1 0 2020-03-23 06:41:36

解决方案2 0 2020-03-23 14:27:44

解决方案1
0 2020-03-23 06:41:36

解决方案2
0 2020-03-23 14:27:44