简体   繁体   中英

How to create a dictionary dynamically based on number of attributes?

I have a CSV file with 6 attributes and 1 class which I read with Pandas.

CsvFile = "/path/to/file.csv"
df = pd.read_csv(CsvFile)

First 5 rows of my CSV:

x,y,x1,y1,x2,y2,class
92,115,120,94,84,102,3
84,102,106,79,84,102,3
84,102,102,83,80,102,3
80,102,102,79,84,94,3
84,94,102,79,80,94,3

Since I have 6 attributes, I want to create a dictionary in Python (6 keys, 5 values each key) which will have the centroids for kmeans.

numberOfClusters = 5
centroids =
{
    i+1: [random.uniform(0.0, 255.0), random.uniform(0.0, 255.0),
          random.uniform(0.0, 255.0), random.uniform(0.0, 255.0),
          random.uniform(0.0, 255.0), random.uniform(0.0, 255.0)]
    for i in range(numberOfClusters)
}

Question nr.1: as you understand, it's not very productive to copy-paste the random.uniform(0.0, 255.0) as many times as the random points I want to get in order to match the number of attributes in my CSV file. Any idea how to do that without copy-paste?

In a similar fashion, in the following code I calculate the Euclidean distance.

for i in centroids.keys():
    df['distance_from_{}'.format(i)] = (
        np.sqrt(
            (df['x'] - centroids[i][0]) ** 2
            + (df['y'] - centroids[i][1]) ** 2
            + (df['x.1'] - centroids[i][2]) ** 2
            + (df['y.1'] - centroids[i][3]) ** 2
            + (df['x.2'] - centroids[i][4]) ** 2
            + (df['y.2'] - centroids[i][5]) ** 2
        )
    )

Question nr.2: if I have more attributes I have to add more df['x'] - centroids[i][0]) ** 2 , whereas delete one or more if I have less. How can I automate this process a bit?

The reason for not using scikit's kmeans is that I want to calculate weights per cluster.

If number of keys is the problem you can use

n=0
with open('filename.csv','r') as f:
    l=f.readline().strip()
    n=len(l.split(','))

where n holds number of keys

First question: replace your list by

[random.uniform(0.0, 255.0) for x in range(6)]

Second question:

np.sqrt(np.sum(np.pow(df[df.columns[:5]] - centroid[i], 2)) should work.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM