简体   繁体   中英

Counting number of commas in every row of a file

I have a file which looks like -

Col1,Col2,Col3,Col4
Value11,Value12,Value13,Value14
Value21,Value22,Value23,Value24
..
..

I have loaded the file in a Pyspark dataframe (I cannot use python because the dataset is huge)

w1 = spark.read.format('csv').options(header='false', inferschema='false').load('./part1')

I want to check if each row has the same number of commas. Is there a way to output the rows which have count of commas not equal to 3?

Since you want to know the lines which are erroneous also the only way is to loop:

In [18]: erroneous_lines = []

In [19]: with open(r'C:\Users\abaskaran\Desktop\mycsv.txt') as fd:
    ...:     for line_num, line in enumerate(fd,1):
    ...:         if len(line.split(',')) != 4:
    ...:             erroneous_lines.append((line_num, line))


In [20]: erroneous_lines
Out[20]:
[(5, 'Value21,Value22,Value23,Value24Value11,Value12,Value13,Value14\n'),
 (6, 'Value21,Value22,Value23\n')]

The erroneous_lines list will have a list of tuples, having line number and actual content of the line which doesnt have all the values..

I modified the CSV content as belowj just for testing:

Col1,Col2,Col3,Col4
Value11,Value12,Value13,Value14
Value21,Value22,Value23,Value24
Value11,Value12,Value13,Value14
Value21,Value22,Value23,Value24Value11,Value12,Value13,Value14
Value21,Value22,Value23
Value11,Value12,Value13,Value14
Value21,Value22,Value23,Value24
Value11,Value12,Value13,Value14
Value21,Value22,Value23,Value24

Read the csv file as text and split the values by , and count the elements.

df = spark.read.text('test.csv')
df.show(10, False)

+-------------------------------+
|value                          |
+-------------------------------+
|Col1,Col2,Col3,Col4            |
|Value11,Value12,Value13,Value14|
|Value21,Value22,Value23,Value24|
+-------------------------------+

import pyspark.sql.functions as F

df2 = df.withColumn('count', F.size(F.split('value', ',')))
df2.show(10, False)

+-------------------------------+-----+
|value                          |count|
+-------------------------------+-----+
|Col1,Col2,Col3,Col4            |4    |
|Value11,Value12,Value13,Value14|4    |
|Value21,Value22,Value23,Value24|4    |
+-------------------------------+-----+

df2.groupBy().agg(F.min('count'), F.max('count')).show(10, False)

+----------+----------+
|min(count)|max(count)|
+----------+----------+
|4         |4         |
+----------+----------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM