简体   繁体   English

从字符串中提取整数

[英]Extract integers from strings

I want to create the target list for a clustering problem with numerous classes from the list of the class names (labels) across each instance of my dataset. 我想从数据集的每个实例的类名称(标签)列表中创建具有多个类的聚类问题的目标列表。

class_name = ['class_1','class_2','class_3','class_3','class_1','class_2',\
'class_2','class_1','class_1','class_2','class_1','class_3'] 

The target list should be like an array in the same length as the class_name list, where an integer is assigned to different class label. 目标列表应类似于长度与class_name列表相同的数组,其中将整数分配给不同的类标签。 Which is the bets way to get this? 哪种投注方式可以做到这一点?

target = np.array([1, 2, 3, 3, 1, 2, 2, 1, 1, 2, 1, 3])

A class label (eg class_1) is in the form 'Xx_xx_xxx(A123)' or 'Xx_xx_xxx (A123)'. 类标签(例如class_1)的格式为“ Xx_xx_xxx(A123)”或“ Xx_xx_xxx(A123)”。 The text in the parenthesis is not fixed. 括号中的文字不固定。 The list type is 'unicode' . 列表类型为'unicode'

You can use a list comprehension to split the strings on '_' characters, take the digit at index [1] , then convert to int 您可以使用列表split将字符串split'_'字符,将数字从索引[1]转换为int

>>> target = np.array([int(i.split('_')[1]) for i in class_name])
>>> target
array([1, 2, 3, 3, 1, 2, 2, 1, 1, 2, 1, 3])

The first thing that you should do is get the classes in a standard format. 您应该做的第一件事就是以标准格式获取类。 From what you described above, if the classname is in the parens within the string, then you can use a regex to just get the classname. 根据上面的描述,如果类名位于字符串的括号内,则可以使用正则表达式来获取类名。

import re
X = ['abc(class_1)', 'cde_(class_1)', 'def_(class_2)']
just_classes = [re.findall(r'\((.*)\)', thing)[0] for thing in X]
# ['class_1', 'class_1', 'class_2']

There are a few different approaches you can use here. 您可以在此处使用几种不同的方法。 If you're doing ml with the numpy, scipy stack, I'd suggest learning the sklearn library. 如果您正在使用numpy,scipy堆栈进行ml处理,建议您学习sklearn库。 It has a lot of useful machine learning and AI tools including encoding class names. 它具有许多有用的机器学习和AI工具,包括编码类名。

Using sklearn 使用sklearn

from sklearn.preprocessing import LabelEncoder
class_names = ['class_1','class_2','class_3','class_3','class_1','class_2',\
        'class_2','class_1','class_1','class_2','class_1','class_3'] 

my_enc = LabelEncoder()
my_enc.fit(class_names)
encoded1 =  my_enc.transform(class_names)

No external library 没有外部图书馆

classes = set(class_names)
d = {c:i for i,c in enumerate(classes)}
encoded2 = [d[c_name] for c_name in class_names]
print encoded1 #approach 1 gives numpy array
print encoded2 # approach 2 gives standard python list

Both of the approaches should work. 两种方法都应该起作用。 It's not much code to implement on your own, but in general, I'd suggest looking at the sklearn preprocessing tools. 自己实现的代码并不多,但是总体而言,我建议您看一下sklearn 预处理工具。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM