简体   繁体   中英

How to conditionally partition a pandas dataframe

I am working on a program to partition a set of data via pandas. This question does not answer my question. The program uses segmentation by natural partitioning. The goal is to

  1. calculate the 5th percentile
  2. calculate the 95th percentile
  3. sort the data
  4. partition the dataset such that only the values from floor(n 0.05) and floor(n 0.95) remain.

I've written a method that process the data. Previously, I was using

def segmentation_by_natural_partitioning(attribute):
    print(attribute.head())
    a = np.array(attribute)

    # calculate 5th and 95th percentiles.
    fith_percentile = np.percentile(a, 5)
    nienty_fith_percentile = np.percentile(a, 95) 

    # sort the data.
    sorted_data = np.sort(a)
    n = a.size
    # keep the values from floor(n*0.05) to floor(n*0.95)
    new_a = split(a, (a > np.math.floor(n*fith_percentile)) & (a < np.math.floor(n*nienty_fith_percentile)))
    return attribute

I'd like to replace

    new_a = split(a, (a > np.math.floor(n*fith_percentile)) & (a < np.math.floor(n*nienty_fith_percentile)))

with

s = s[(s['A2'] > np.math.floor(n*fith_percentile)) an
d (s['A2'] <= np.math.floor(n*nienty_fith_percentile))]

The full program is written like so

from numpy.core.defchararray import count
import pandas as pd
import numpy as np
import numpy as np


def print_full(x):
    pd.set_option('display.max_rows', len(x))
    print(x)
    pd.reset_option('display.max_rows')

def main():
    s = pd.read_csv('A1-dm.csv')
    # entropy_discretization(df['A1'])
    segmentation_by_natural_partitioning(s)

# This method discretizes attribute A1
# If the information gain is 0, i.e the number of 
# distinct class is 1 or
# If min f/ max f < 0.5 and the number of distinct values is floor(n/2)
# Then that partition stops splitting.
def entropy_discretization(s):
    # pick a threshold
    threshold = 6
    print(segmentation_by_natural_partitioning(s))
    print(s.head())


def segmentation_by_natural_partitioning(s):
    a = np.array(s)

    # calculate 5th and 95th percentiles.
    fith_percentile = np.percentile(a, 5)
    nienty_fith_percentile = np.percentile(a, 95) 

    # sort the data.
    sorted_data = np.sort(a)
    n = a.size
    # keep the values from floor(n*0.05) to floor(n*0.95)
    s = s[(s['A2'] > np.math.floor(n*fith_percentile)) and (s['A2'] <= np.math.floor(n*nienty_fith_percentile))]

    return s


main()

A sample of the dataset is provided here

A1,A2,A3,Class
2,0.4631338,1.5,3
8,0.7460648,3.0,3
6,0.264391038,2.5,2
5,0.4406713,2.3,1
2,0.410438159,1.5,3
2,0.302901816,1.5,2
6,0.275869396,2.5,3
8,0.084782428,3.0,3

When I try to run my code I get the following error:

f"The truth value of a {type(self).__name__} is ambiguous. "
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I am specifically looking for a way to partition the dataset via pandas. Any help would be greatly appreciated.

The answer was simple. I just needed to break up the dataframe

 s = s[s['A2'] > fith_percentile]
 s = s[s['A2'] < nienty_fith_percentile]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM