[英]How to format CSV data by removing quotes and double-quotes around fields
我正在使用一个数据集,显然它的每一行都有“双引号”。 我看不到它,因为当我使用浏览器时,它默认以 Excel 打开。
数据集如下所示(原始):
"age;"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y""----header 58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"--row
我使用以下代码:
val bank = spark.read.format("com.databricks.spark.csv").
| option("header", true).
| option("ignoreLeadingWhiteSpace", true).
| option("inferSchema", true).
| option("quote", "").
| option("delimiter", ";").
| load("bank_dataset.csv")
但我得到的是:两端带引号的数据和用双双引号括起来的字符串值我想要的是: age as int 和单引号括在字符串值上
如果您仍然有这些原始数据并且想要清理,那么您可以使用regex_replace
替换所有双引号"
val expr = df.columns
.map(c => regexp_replace(col(c), "\"", "").as(c.replaceAll("\"", "")))
df.select(expr: _*).show(false)
Output:
+---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+
|age|job |marital|education|default|balance|housing|loan|contact|day|month|duration|campaign|pdays|previous|poutcome|y |
+---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+
|58 |management|married|tertiary |no |2143 |yes |no |unknown|5 |may |261 |1 |-1 |0 |unknown |no |
+---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+---+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.