简体   繁体   中英

Store each row of a pandas dataframe into a temporary dataframe

I am using scipy to compare different distance functions using data contained in pandas dataframes . For reference, I am checking the distance between different parts my company manufactures.

(This is obviously a toy example for this question. Sorry if something is not complete, I am trying to make Minimal, Reproducible Example )

I have the test dataframe, x which looks like this:

| part_number | make_buy_M | make_buy_B | alternate_Y | alternate_N | value |
|:-----------:|:----------:|:----------:|:-----------:|:-----------:|:-----:|
|      A      |      1     |      0     |      0      |      1      |  1065 |

I then have a large dataframe, data , which looks exactly the same but contains many parts:

| part_number | make_buy_M | make_buy_B | alternate_Y | alternate_N | value |
|:-----------:|:----------:|:----------:|:-----------:|:-----------:|:-----:|
|      B      |      1     |      0     |      0      |      1      |  982  |
|      C      |      0     |      1     |      0      |      1      |   87  |
|      D      |      1     |      0     |      0      |      1      |  2342 |
|      E      |      0     |      1     |      1      |      0      | 56233 |

I have a function that loops through scipy distance metrics. What I would like to do is, compare the x value to each row of the dataframe, and store those results in a dict

import pandas as pd, numpy as np, scipy, gc as gc
from math import *
from decimal import Decimal
from scipy import spatial

# Resources:
#   - https://dataconomy.com/2015/04/implementing-the-five-most-popular-similarity-measures-in-python/
# Resources

def euclidean_distance(x, y):
    return sqrt(sum(pow(a-b,2) for a, b in zip(x, y)))  

def cosine_similarity(x,y):
    def square_rooted(x):
       return round(sqrt(sum([a*a for a in x])),3)

    numerator = sum(a*b for a,b in zip(x,y))
    denominator = square_rooted(x)*square_rooted(y)

    return round(numerator/float(denominator),3)

# Read in CSV
x = pd.read_csv('Test_Part_Directory')
y = pd.read_csv('Other_Parts_Directory')
metrics = ['cosine', 'euclidean']
euclidean_dict = {}
cosine_dict = {}


# How to loop through the y for this?

# for x in y.rows():
    # current_row = y[x]
    # Then do the below codes

for m in metrics:
    try:       
        curr = scipy.spatial.distance.cdist(x.iloc[:,:], y.iloc[:,:], metric=m)        
        print("Metric: {} | Score: {} ".format(m, curr))

        """
            Currently commented out
        if m == 'cosine':
            cosine_dict[part_number from dict] = curr
        else:
            euclidean_dict[part_number from dict] = curr
        """

    except:
        print("Error calculating {}".format(m))

Ultimately, I am looking for two dicts that contain key, value pairs of: part_number: metric_score , so something like:

I have written this code that gets to the current point, but have not eucliean_dict = {'B': 0.954, 'C': 0.233, 'D': 0.003, 'E': 0.012}

I have looked at this question , but it tells me do not loop.

UPDATE - I did try the following:

for index, row in data.iterrows():
    part_number = data['PART_NO'].iloc[0]
    y = row.drop('PART_NO', axis=1)
    for m in metrics:
        try:
            curr = scipy.spatial.distance.cdist(x.iloc[:,:], y.iloc[:,:], metric=m)
            print("Part Number: {} | Metric: {} | Score: {} ".format(part_number, m, curr))
        except:
            print("Error calculating {}".format(m))

But received:

Traceback (most recent call last):
  File "distance_function.py", line 95, in <module>
    y = row.drop('PART_NO', axis=1)
  File "C:\Python367-64\lib\site-packages\pandas\core\series.py", line 4139, in drop
    errors=errors,
  File "C:\Python367-64\lib\site-packages\pandas\core\generic.py", line 3923, in drop
    axis_name = self._get_axis_name(axis)
  File "C:\Python367-64\lib\site-packages\pandas\core\generic.py", line 420, in _get_axis_name
    raise ValueError(f"No axis named {axis} for object type {cls}")
ValueError: No axis named 1 for object type <class 'pandas.core.series.Series'>

UPDATE 2 - I did try the following:

part_number = data['PART_NO'].iloc[0]
temp = row.to_frame()
y = temp.drop('PART_NO', axis=1)

Yet I receive the same error.

It seems like your column names are now indices. So, if you want to drop 'PART_NO' , you should do it with axis=0 :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM