简体   繁体   中英

How to find/filter/combine based on common prefix in rows and columns with use of python/pandas?

I'm new to coding and having a hard time expressing/searching for the correct terms to help me along with this task. In my work I get some pretty large excel-files from people out in the field monitoring birds. The results need to be prepared for databases, reports, tables and more. I was hoping to use Python to automate some tasks for this.

How can I use Python (pandas?) to find certain rows/columns based on a common name/ID but with a unique suffix, and aggregate/sum the results that belongs together under that common name? As an example in the table provided I need get all the results from sub-localities eg AA3_f, AA3_lf and AA3_s expressed as the sum (total of gulls for each species) of the subs in a new row for the main Locality AA3.

表/数据框示例

Can someone please provide some code for this task, or help me in some other way? I have searched and watched many tutorials on python, numpy, pandas and also matplotlib.. still clueless on how to set this up

any help appreciated Thanks!

Update:

@Harsh Nagouda, thanks for your reply. I tried your example using groupby function, but I having trouble dividing into correct groups. The "Locality" column has only unique values/ID because they all have a suffix (they are sub categories).

I tried to solve this by slicing the strings:

eng.Locality.str.slice(0,4,1)

i managed to slice off the suffices so that the remainders = AA3_, AA4_ and so on. Then i tried to do this slicing in the groupby function. That failed. Then I tried to slice using pandas.Dataframe.apply(). That failed as well.

eng["Locality"].apply(eng.Locality.str.slice(0,4,1))
sum = eng.groupby(["Locality"].str.slice(0,4,1)).sum()

Any more help out there? As you can see above - I need it:-)

In your case, the pd.groupby option seems to be a good fit for the problem. The groupby function does exactly what it means, it groups parts of the dataframe you like it to.

Since you mentioned a case based on grouping by localities and finding the sum of those values, this snippet should help you out:

sum = eng.groupby(["Locality"]).sum()

Additional commands and sorting styles can be found here:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

I finally figured out a way to get it done. Maybe not the smoothest way, but at least I get the end result I need:

  1. Edited the Locality-ID to remove suffix: eng["Locality"]=eng["Locality].str.slice(0,4,1)

  2. Used the groupby function: sum = eng.groupby(["Locality"]).sum()

  3. End result: Table

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM