简体   繁体   中英

Group by to count occurrences of unique ID/itemName across multiple CSV files and sum up value in another column using pandas/python

I'm currently trying to do the following:

  1. Check the number of time "id"/"itemName" occur across different csv files in mypath (al files are formatted the same [see below sampleCSVFile ], id and name is unique in every file, but may repeat or not in other files)
  2. Sum value for the "Entries" column for a given id/itemName across all the files.
  3. Get the highest/lowest number in "Entries" column across all the files for a given id/itemName

sampleCSVFile in my mypath:

id    itemName    NonImportantEntries    Entries    SomeOtherEntries
1      item1              27              111             163
2      item2              16               22              98

My Code so far:

import pandas as pd

from os import listdir
from os.path import isfile, join

#readMyDirectoryForFiles
mypath= <myDirectoryPath>
   
list_of_files = [f for f in listdir(mypath) if isfile(join(mypath, f))]

#GenerateAllFilesInDirList
list_of_files = [mypath+x for x in list_of_files]

#LoadAllFilesToPandasDFs
dataframes = [pd.read_csv(fi) for fi in list_of_files]

#count id/itemName Occurences Across All Files and sum up value in column "entries", but also check for highest/lowest value in column "entries" and record it for the output file -> this step is really hard for me.

#tried the below but received -> AttributeError: 'list' object has no attribute 'groupby' , also not sure where to start and how proceed with capturing highest/lowest number for "entries" across all the files.
count_occurences = dataframes.groupby(['id', 'itemName']).sum()['entries']



...
#save file in the following format
df.to_csv('finalFile.csv', index = False)

Expected finalFile.csv format:

    id  itemName    noOfOccurencessInFiles  TotalEntriesFromAllFiles    HighestEntriesNumber    LowestEntriesNumber
    1     item1            11                       3500                       1200                      0

Use concat and add aas_index=False for columns from id, itemName :

final = pd.concat(dataframes).groupby(['id', 'itemName'], as_index=False)['entries'].sum()

fnal.to_csv('finalFile.csv', index = False)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM