如何为DataFrame行分配唯一标识符

Question

I have a .csv file that is created from an nd.array after the input data is processed by sklearn.cluster.DBSCAN() .I would like to be able to "tie" every point in the cluster to an unique identifier given by a column in my input file. 在sklearn.cluster.DBSCAN()处理输入数据之后，我有一个从nd.array创建的.csv文件。我希望能够将集群中的每个点“绑定”到由我输入文件中的一列。

This is how I'm reading my input_data : 这就是我正在读取input_data ：

# Generate sample data
col_1 ="RL15_LONGITUDE"
col_2 ="RL15_LATITUDE"  
data = pd.read_csv("2004_Charley_data.csv")
coords = data.as_matrix(columns=[col_1, col_2])
data = data[[col_1,col_2]].dropna()
data = data.as_matrix().astype('float16',copy=False)

And this is what it looks like: 这就是它的样子：

RecordID                  Storm         RL15_LATITUDE   RL15_LONGITUDE
2004_Charley95104-257448  2004_Charley  25.81774        -80.25079
2004_Charley93724-254950  2004_Charley  26.116338       -81.74986
2004_Charley93724-254949  2004_Charley  26.116338       -81.74986
2004_Charley75496-215198  2004_Charley  26.11817        -81.75756

With some help I was able to take the output of DBSCAN and save it as a .CSV file like this: 在一些帮助下，我能够获取DBSCAN的输出并将其保存为.CSV文件，如下所示：

clusters = (pd.concat([pd.DataFrame(c, columns=[col_2,col_1]).assign(cluster=i)
for i,c in enumerate(clusters)])
.reset_index()
.rename(columns={'index':'point'})
.set_index(['cluster','point'])
)
clusters.to_csv('output.csv')

My output now is multi-index, but I would like to know if there's a way I could change the column point to RecordID instead of just a number? 我的输出现在是多索引，但我想知道是否有一种方法可以将列点更改为RecordID而不仅仅是一个数字？ : ：

cluster point   RL15_LATITUDE   RL15_LONGITUDE
0       0   -81.0625    29.234375
0       1   -81.0625    29.171875
0       2   -81.0625    29.359375
1       0   -81.0625    29.25
1       1   -81.0625    29.21875
1       2   -81.0625    29.25
1       3   -81.0625    29.21875

Answer 1

UPDATE: 更新：

Code: 码：

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler

fn = r'D:\temp\.data\2004_Charley_data.csv'
df = pd.read_csv(fn)

cols = ['RL15_LONGITUDE','RL15_LATITUDE']
eps_=4
min_samples_=13

db = DBSCAN(eps=eps_/6371., min_samples=min_samples_, algorithm='ball_tree', metric='haversine').fit(np.radians(df[cols]))

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

df['cluster'] = labels

res = df[df.cluster >= 0]

print('--------------------------------')
print(res)
print('--------------------------------')
print(res.cluster.value_counts())

Output: 输出：

--------------------------------
                     RecordID         Storm  RL15_LATITUDE  RL15_LONGITUDE  cluster
5    2004_Charley73944-211787  2004_Charley      29.228560      -81.034440        0
13   2004_Charley72308-208134  2004_Charley      29.442692      -81.109528        0
18   2004_Charley68044-198941  2004_Charley      29.442692      -81.109528        0
19   2004_Charley67753-198272  2004_Charley      29.270940      -81.097300        0
22   2004_Charley64829-191531  2004_Charley      29.313223      -81.101620        0
..                        ...           ...            ...             ...      ...
812  2004_Charley94314-256039  2004_Charley      28.287827      -81.353285        1
813  2004_Charley93913-255344  2004_Charley      26.532980      -82.194400        7
814  2004_Charley93913-255346  2004_Charley      27.210467      -81.863720        5
815  2004_Charley93913-255357  2004_Charley      26.935550      -82.054447        4
816  2004_Charley93913-255354  2004_Charley      26.935550      -82.054447        4

[688 rows x 5 columns]
--------------------------------
1    217
0    170
2    145
4     94
7     18
6     16
5     14
3     14
Name: cluster, dtype: int64

Old answer: 老答案：

If i understood your code correctly you can do it this way: 如果我正确理解您的代码，您可以这样做：

# read CSV (you have provided space-delimited file and with one unnamed column, so i have converted it to somewhat similar to that from your question)
fn = r'D:\temp\.data\2004_Charley_data.csv'

df = pd.read_csv(fn, sep='\s+', index_col=0)
df.index = df.index.values + df.RecordID.map(str)
del df['RecordID']

first 10 rows: 前10行：

In [148]: df.head(10)
Out[148]:
                                 Storm  RL15_LATITUDE  RL15_LONGITUDE
RecordID
2004_Charley67146-196725  2004_Charley      33.807550      -78.701172
2004_Charley73944-211790  2004_Charley      33.618435      -78.993407
2004_Charley73944-211793  2004_Charley      28.609200      -80.818880
2004_Charley73944-211789  2004_Charley      29.383210      -81.160100
2004_Charley73944-211786  2004_Charley      33.691235      -78.895129
2004_Charley73944-211787  2004_Charley      29.228560      -81.034440
2004_Charley73944-211795  2004_Charley      28.357253      -80.701632
2004_Charley73944-211792  2004_Charley      34.204490      -77.924700
2004_Charley66636-195501  2004_Charley      33.436717      -79.132074
2004_Charley66631-195496  2004_Charley      33.646292      -78.977968

clustering: 集群：

cols = ['RL15_LONGITUDE','RL15_LATITUDE']

eps_=4
min_samples_=13

db = DBSCAN(eps=eps_/6371., min_samples=min_samples_, algorithm='ball_tree', metric='haversine').fit(np.radians(df[cols]))

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

setting cluster info to our DF - we can simply assign it as labels has the same length as our DF: 将集群信息设置为我们的DF - 我们可以简单地将其分配为labels与DF的长度相同：

df['cluster'] = labels

filter: keep only those rows where cluster >= 0 : filter：仅保留cluster >= 0那些行：

res = df[df.cluster >= 0]

result: 结果：

In [152]: res.head(10)
Out[152]:
                                 Storm  RL15_LATITUDE  RL15_LONGITUDE  cluster
RecordID
2004_Charley73944-211787  2004_Charley      29.228560      -81.034440        0
2004_Charley72308-208134  2004_Charley      29.442692      -81.109528        0
2004_Charley68044-198941  2004_Charley      29.442692      -81.109528        0
2004_Charley67753-198272  2004_Charley      29.270940      -81.097300        0
2004_Charley64829-191531  2004_Charley      29.313223      -81.101620        0
2004_Charley67376-197429  2004_Charley      29.196990      -80.993800        0
2004_Charley73720-211013  2004_Charley      29.171450      -81.037170        0
2004_Charley73705-210991  2004_Charley      28.308746      -81.424273        1
2004_Charley65157-192371  2004_Charley      28.308746      -81.424273        1
2004_Charley65126-192326  2004_Charley      28.308746      -81.424273        1

stats: 统计：

In [151]: res.cluster.value_counts()
Out[151]:
1    217
0    170
2    145
4     94
7     18
6     16
5     14
3     14
Name: cluster, dtype: int64

if you don't want to have RecordID as index: 如果您不想将RecordID作为索引：

In [153]: res = res.reset_index()

In [154]: res.head(10)
Out[154]:
                   RecordID         Storm  RL15_LATITUDE  RL15_LONGITUDE  cluster
0  2004_Charley73944-211787  2004_Charley      29.228560      -81.034440        0
1  2004_Charley72308-208134  2004_Charley      29.442692      -81.109528        0
2  2004_Charley68044-198941  2004_Charley      29.442692      -81.109528        0
3  2004_Charley67753-198272  2004_Charley      29.270940      -81.097300        0
4  2004_Charley64829-191531  2004_Charley      29.313223      -81.101620        0
5  2004_Charley67376-197429  2004_Charley      29.196990      -80.993800        0
6  2004_Charley73720-211013  2004_Charley      29.171450      -81.037170        0
7  2004_Charley73705-210991  2004_Charley      28.308746      -81.424273        1
8  2004_Charley65157-192371  2004_Charley      28.308746      -81.424273        1
9  2004_Charley65126-192326  2004_Charley      28.308746      -81.424273        1

如何为DataFrame行分配唯一标识符

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-09-16 20:43:07

UPDATE: 更新：

如何为DataFrame行分配唯一标识符

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-09-16 20:43:07

UPDATE: 更新：

解决方案1
1 已采纳 2016-09-16 20:43:07