简体   繁体   中英

How to count No. of rows with special characters in all columns of a PySpark DataFrame?

Assume that I have a PySpark DataFrame. Some of the cells contain only special characters.

Sample dataset:

import pandas as pd 

data = {'ID': [1, 2, 3, 4, 5, 6],
        'col_1': ['A', '?', '<', ' ?', None, 'A?'],
        'col_2': ['B', ' ', '', '?>', 'B', '\B']
       }

pdf = pd.DataFrame(data)
df = spark.createDataFrame(pdf)

在此处输入图像描述

I want to count the number of rows which contain only special characters (except blank cells). Values like 'A?'and '\B' and blank cells are not counted.

The expected output will be:

{'ID': 0, 'col_1': 3, 'col_2': 1}

Is there anyway to do that?

Taking your sample dataset, this should do:

import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when

spark = SparkSession.builder.getOrCreate()
import pandas as pd

data = {'ID': [1, 2, 3, 4, 5, 6],
        'col_1': ['A', '?', '<', ' ?', None, 'A?'],
        'col_2': ['B', ' ', '', '?>', 'B', '\B']
       }

pdf = pd.DataFrame(data)
df = spark.createDataFrame(pdf)

res = {}
for col_name in df.columns:
    df = df.withColumn('matched', when((col(col_name).rlike('[^A-Za-z0-9\s]')) & ~(col(col_name).rlike('[A-Za-z0-9]')), True).otherwise(False))
    res[col_name] = df.select('ID').where(df.matched).count()

print(res)

The trick is to use regular expressions with two conditions to filter the cells that are valid according to your logic.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM