Counting number of commas in every row of a file

Question

I have a file which looks like -

Col1,Col2,Col3,Col4
Value11,Value12,Value13,Value14
Value21,Value22,Value23,Value24
..
..

I have loaded the file in a Pyspark dataframe (I cannot use python because the dataset is huge)

w1 = spark.read.format('csv').options(header='false', inferschema='false').load('./part1')

I want to check if each row has the same number of commas. Is there a way to output the rows which have count of commas not equal to 3?

Answer 1

Since you want to know the lines which are erroneous also the only way is to loop:

In [18]: erroneous_lines = []

In [19]: with open(r'C:\Users\abaskaran\Desktop\mycsv.txt') as fd:
    ...:     for line_num, line in enumerate(fd,1):
    ...:         if len(line.split(',')) != 4:
    ...:             erroneous_lines.append((line_num, line))


In [20]: erroneous_lines
Out[20]:
[(5, 'Value21,Value22,Value23,Value24Value11,Value12,Value13,Value14\n'),
 (6, 'Value21,Value22,Value23\n')]

The erroneous_lines list will have a list of tuples, having line number and actual content of the line which doesnt have all the values..

I modified the CSV content as belowj just for testing:

Col1,Col2,Col3,Col4
Value11,Value12,Value13,Value14
Value21,Value22,Value23,Value24
Value11,Value12,Value13,Value14
Value21,Value22,Value23,Value24Value11,Value12,Value13,Value14
Value21,Value22,Value23
Value11,Value12,Value13,Value14
Value21,Value22,Value23,Value24
Value11,Value12,Value13,Value14
Value21,Value22,Value23,Value24

Answer 2

Read the csv file as text and split the values by , and count the elements.

df = spark.read.text('test.csv')
df.show(10, False)

+-------------------------------+
|value                          |
+-------------------------------+
|Col1,Col2,Col3,Col4            |
|Value11,Value12,Value13,Value14|
|Value21,Value22,Value23,Value24|
+-------------------------------+

import pyspark.sql.functions as F

df2 = df.withColumn('count', F.size(F.split('value', ',')))
df2.show(10, False)

+-------------------------------+-----+
|value                          |count|
+-------------------------------+-----+
|Col1,Col2,Col3,Col4            |4    |
|Value11,Value12,Value13,Value14|4    |
|Value21,Value22,Value23,Value24|4    |
+-------------------------------+-----+

df2.groupBy().agg(F.min('count'), F.max('count')).show(10, False)

+----------+----------+
|min(count)|max(count)|
+----------+----------+
|4         |4         |
+----------+----------+

Counting number of commas in every row of a file

Question

2 answers

solution1
1 2020-08-08 14:12:39

solution2
1 ACCPTED 2020-08-08 14:23:32

Counting number of commas in every row of a file

Question

2 answers

solution1 1 2020-08-08 14:12:39

solution2 1 ACCPTED 2020-08-08 14:23:32

solution1
1 2020-08-08 14:12:39

solution2
1 ACCPTED 2020-08-08 14:23:32