简体   繁体   中英

Splitting/reading CSV file by distinct row

I have a csv file with 3 columns.

Key,Branch,Account 
a,213,234567
a,454,457900
a,562,340094
a,200,456704
b,400,850988
b,590,344433
c,565,678635
c,300,453432
c,555,563546
c,001,660905

I would like to iterate through each row and get distinct rows from the Key column (a,b & c) and split them into 3 different pyspark datagrams.

   a,213,234567
   a,454,457900
   a,562,340094
   a,200,456704


   b,400,850988
   b,590,344433


   c,565,678635
   c,300,453432
   c,555,563546
   c,001,660905

Something like this?

csv_string = """Key,Branch,Account 
a,213,234567
a,454,457900
a,562,340094
a,200,456704
b,400,850988
b,590,344433
c,565,678635
c,300,453432
c,555,563546
c,001,660905"""

import csv
import io

#
# 1. Parse csv_string into a list of ordereddicts
#

def parse_csv(string):
    # if you are reading from a file you don't need to do this
    # StringIO nonsense -- just pass the file to csv.DictReader()
    string_file = io.StringIO(string)
    reader = csv.DictReader(string_file)
    return list(reader)

csv_table = parse_csv(csv_string)

#
# 2. Loop through each line of the table and get the key
#  - If we have seen the key before, put the line in the list
#    with other lines that had the same key
#  - If not, start a new list for that key
#

result = {}

for line in csv_table:
    key = line["Key"].strip()
    print(key, ":", line)
    if key in result:
        result[key].append(line)
    else:
        result[key] = [line]

#
# 3. Finally, print the result.
# The lines will probably be easier to deal with if you keep them 
# in their parsed form, but for readability we can join the values
# of the line back into a string with commas
#

print(result)
print("")

for key_list in result.values():
    for line in key_list:
        print(",".join(line.values()))
    print("")

You can use pandas library to do the same and it will also provides you capability to do more operations with minimal code. Please read about pandas here

Here is the code to get desired output. I am storing data in dictionary so you can get desired data using dict[key] ex. dict[a]

import pandas

df = pandas.read_csv("data.csv", delimiter=",")

keys = df["Key"].unique() #This will provide all unique keys from csv

sorted_DF = df.groupby("Key") #Sort data based on value of column Key

dict = {} #To store data based on key
for key in keys:
    dict[key] = sorted_DF.get_group(key).values.tolist()

for key in keys:
    print("{} : {}".format(key, dict[key]))

Output :

a : [['a', 213, 234567], ['a', 454, 457900], ['a', 562, 340094], ['a', 200, 456704]]

b : [['b', 400, 850988], ['b', 590, 344433]]

c : [['c', 565, 678635], ['c', 300, 453432], ['c', 555, 563546], ['c', 1, 660905]]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM