[英]How to replace special character using regex in pyspark
我正在尝试将}{
文本文件中的},{
替换为},{
但出现错误提示
return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer
我正在用python(pyspark)编写Spark作业。
码:
from pyspark.sql import SparkSession
import re
if __name__ == "__main__":
if len(sys.argv) != 2:
print("Usage: PythonBLEDataParser.py <file>", file=sys.stderr)
exit(-1)
spark = SparkSession\
.builder\
.appName("PythonBLEDataParser")\
.getOrCreate()
toJson = spark.sparkContext.textFile("/root/vasi/spark-2.2.0-bin-hadoop2.7/vas_files/BLE_data_Sample.txt")
toJson1 = re.sub("}{","},{",toJson) #i want to replace }{ with },{
print(toJson1)
样本数据:
{"EdgeMac":"E4956E4E4015","BeaconMac":"247189F24DDB","RSSI":-59,"MPow":-76,"Timestamp":"1486889542495633","AdData":"0201060303AAFE1716AAFE00DD61687109E602F514C96D00000001F05C0000"}
{"EdgeMac":"E4956E4E4016","BeaconMac":"247189F24DDC","RSSI":-59,"MPow":-76,"Timestamp":"1486889542495633","AdData":"0201060303AAFE1716AAFE00DD61687109E602F514C96D00000001F05C0000"}
{"EdgeMac":"E4956E4E4017","BeaconMac":"247189F24DDD,"RSSI":-59,"MPow":-76,"Timestamp":"1486889542495633","AdData":"0201060303AAFE1716AAFE00DD61687109E602F514C96D00000001F05C0000"}
尝试使用数据框而不是rdd及其工作方式。 只是在转括号前放置了转义字符
df_sample = spark.read.text('path/to/sample.txt')
df_sample.withColumn('value',regexp_replace(df_sample['value'],'\\}\\{','},{')).collect()[0]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.