简体   繁体   English

从训练数据中提取特征

[英]Feature extraction from the training data

I have a training data like below which have all the information under a single column. 我有一个如下所示的培训数据,其中包含一列下的所有信息。 The data set has above 300000 data. 该数据集具有超过300000个数据。

id         features                                                     label

1          name=John Matthew;age=25;1.=Post Graduate;2.=Football Player;    1
2          name=Mark clark;age=21;1.=Under Graduate;Interest=Video Games;   1
3          name=David;age=12;1:=High School;2:=Cricketer;native=america;    2
4          name=George;age=11;1:=High School;2:=Carpenter;married=yes       2
.
.

300000     name=Kevin;age=16;1:=High School;2:=Driver;Smoker=No             3

Now i need to convert this training data like below 现在我需要转换下面的训练数据

 id   name          age   1               2                Interest      married   Smoker
 1    John Matthew   25   Post Graduate   Football Player   Nan           Nan      Nan
 2    Mark clark     21   Under Graduate  Nan               Video Games   Nan      Nan
 .
 .

Is there any efficient way to do this. 有没有有效的方法来做到这一点。 I tried the below code but it took 3 hours to complete 我尝试了以下代码,但需要3个小时才能完成

#Getting the proper features from the features column

    cols = {}
    for choices in set_label:
        collection_list = []
        array = train["features"][train["label"] == choices].values
        for i in range(1,len(array)):
            var_split = array[i].split(";")
            try :
                d = (dict(s.split('=') for s in var_split))
                for x in d.keys():
                    collection_list.append(x)
            except ValueError:
                Error = ValueError
        count = Counter(collection_list)
        for k , v in count.most_common(5):
            key = k.replace(":","").replace(" ","_").lower()
            cols[key] = v

    columns_add = list(cols.keys())
    train = train.reindex(columns = np.append( train.columns.values, columns_add))
    print (train.columns)
    print (train.shape)

#Adding the values for the newly created problem

    for row in train.itertuples():
        dummy_dic = {}
        new_dict={}
        value = train.loc[row.Index, 'features']
        v_split = value.split(";")
        try :
            dummy_dict = (dict(s.split('=') for s in v_split))
            for k, v in dummy_dict.items():
                new_key = k.replace(":","").replace(" ","_").lower()
                new_dict[new_key] = v
        except ValueError:
            Error = ValueError
        for k,v in new_dict.items():
            if k in train.columns:
                train.loc[row.Index, k] = v

Is there any useful function that i can apply here for efficient way of feature extraction ? 是否有任何有用的功能可以在这里应用于有效的特征提取方法?

Create two DataFrames (in the first one all the features are the same for every data point and the second one is a modification of the first one introducing different features for some data points) meeting your criteria: 根据您的标准创建两个DataFrame(在第一个中,所有功能对于每个数据点都相同,第二个是对第一个引入不同功能的某些数据点的修改):

import pandas as pd
import numpy as np
import random
import time
import itertools


# Create a DataFrame where all the keys for each datapoint in the "features" column are the same.
num = 300000


NAMES = ['John', 'Mark', 'David', 'George', 'Kevin']
AGES = [25, 21, 12, 11, 16]
FEATURES1 = ['Post Graduate', 'Under Graduate', 'High School']
FEATURES2 = ['Football Player', 'Cricketer', 'Carpenter', 'Driver']
LABELS = [1, 2, 3]



df = pd.DataFrame()
df.loc[:num, 0]= ["name={0};age={1};feature1={2};feature2={3}"\
                  .format(NAMES[np.random.randint(0, len(NAMES))],\
                          AGES[np.random.randint(0, len(AGES))],\
                          FEATURES1[np.random.randint(0, len(FEATURES1))],\
                          FEATURES2[np.random.randint(0, len(FEATURES2))]) for i in xrange(num)]

df['label'] = [LABELS[np.random.randint(0, len(LABELS))] for i in range(num)]

df.rename(columns={0:"features"}, inplace=True)

print df.head(20)



# Create a modified sample DataFrame from the previous one, where not all the keys are the same for each data point. 


mod_df = df
random_positions1 = random.sample(xrange(10), 5)
random_positions2 = random.sample(xrange(11, 20), 5)

INTERESTS = ['Basketball', 'Golf', 'Rugby']
SMOKING = ['Yes', 'No']

mod_df.loc[random_positions1, 'features'] = ["name={0};age={1};interest={2}"\
                  .format(NAMES[np.random.randint(0, len(NAMES))],\
                          AGES[np.random.randint(0, len(AGES))],\
                          INTERESTS[np.random.randint(0, len(INTERESTS))]) for i in xrange(len(random_positions1))]

mod_df.loc[random_positions2, 'features'] = ["name={0};age={1};smoking={2}"\
                  .format(NAMES[np.random.randint(0, len(NAMES))],\
                          AGES[np.random.randint(0, len(AGES))],\
                          SMOKING[np.random.randint(0, len(SMOKING))]) for i in xrange(len(random_positions2))]


print mod_df.head(20)

Assume that your original data is stored in a DataFrame called df . 假设您的原始数据存储在名为df的DataFrame中。

Solution 1 (all the features are the same for every data point). 解决方案1(每个数据点的所有功能都相同)。

def func2(y):
        lista = y.split('=')
        value = lista[1]
        return value


def function(x):
    lista = x.split(';')
    array = [func2(i) for i in lista]
    return array


# Calculate the execution time
start = time.time()

array = pd.Series(df.features.apply(function)).tolist()
new_df = df.from_records(array, columns=['name', 'age', '1', '2'])

end = time.time()

new_df

print 'Total time:', end - start

Total time: 1.80923295021

Edit: The one thing you need to do is to edit accordingly the columns list. 编辑:您需要做的一件事是相应地编辑columns列表。


Solution 2 (The features might be the same or different for every data point). 解决方案2(每个数据点的功能可能相同或不同)。

import pandas as pd
import numpy as np
import time
import itertools

# The following functions are meant to extract the keys from each row, which are going to be used as columns.
def extract_key(x):
    return x.split('=')[0]

def def_columns(x):
    lista = x.split(';')
    keys = [extract_key(i) for i in lista]
    return keys

df = mod_df
columns = pd.Series(df.features.apply(def_columns)).tolist()
flattened_columns = list(itertools.chain(*columns))
flattened_columns = np.unique(np.array(flattened_columns)).tolist()
flattened_columns

# This function turns each row from the original dataframe into a dictionary.
def function(x):
    lista = x.split(';')
    dict_ = {}
    for i in lista:
        key, val = i.split('=')
        dict_[key ] = val
    return dict_


df.features.apply(function)
arr = pd.Series(df.features.apply(function)).tolist()
pd.DataFrame.from_dict(arr)

Suppose your data is like this : 假设您的数据是这样的:

features= ["name=John Matthew;age=25;1:=Post Graduate;2:=Football Player;", 
 'name=Mark clark;age=21;1:=Under Graduate;2:=Football Player;',
"name=David;age=12;1:=High School;2:=Cricketer;",
"name=George;age=11;1:=High School;2:=Carpenter;", 
'name=Kevin;age=16;1:=High School;2:=Driver; ']
df = pd.DataFrame({'features': features})

I will start by this answer and try to replace all separator (name, age , 1:= , 2:= ) by ; 我将从这个答案开始并尝试将所有分隔符(name,age,1:=,2:=)替换为;

with this function 有了这个功能

def replace_feature(x):
    for r in (("name=", ";"), (";age=", ";"), (';1:=', ';'), (';2:=', ";")):
        x = x.replace(*r)
    x = x.split(';')
    return x
df = df.assign(features= df.features.apply(replace_feature))

After applying that function to your df all the values will a list of features. 将该函数应用于df后,所有值都将显示一系列功能。 where you can get each one by index then I use 4 customs function to get each attribute name, age, grade; 你可以通过索引获得每一个,然后我使用4个海关功能来获取每个属性名称,年龄,等级; job, Note: There can be a better way to do this by using only one function job,注意:只使用一个函数可以有更好的方法

def get_name(df):
    return df['features'][1]
def get_age(df):
    return df['features'][2]
def get_grade(df):
    return df['features'][3]
def get_job(df):
    return df['features'][4]

And finaly applying that function to your dataframe : 最后将该函数应用于您的数据框:

df = df.assign(name = df.apply(get_name, axis=1),
         age = df.apply(get_age, axis=1),
         grade = df.apply(get_grade, axis=1),
         job = df.apply(get_job, axis=1))

Hope this will be quick and fast 希望这会快速而快速

As far as I understand your code, the poor performances comes from the fact that you create the dataframe element by element. 据我了解你的代码,糟糕的表现来自于你按元素创建dataframe元素。 It's better to create the whole dataframe at once whith a list of dictionnaries. 最好用一个字典列表一次创建整个数据帧。

Let's recreate your input dataframe : 让我们重新创建您的输入数据帧:

from StringIO import StringIO
data=StringIO("""id         features                                                     label

1          name=John Matthew;age=25;1.=Post Graduate;2.=Football Player;    1
2          name=Mark clark;age=21;1.=Under Graduate;2.=Football Player;     1
3          name=David;age=12;1:=High School;2:=Cricketer;                   2
4          name=George;age=11;1:=High School;2:=Carpenter;                  2""")
df=pd.read_table(data,sep=r'\s{3,}',engine='python')

we can check : 我们可以查看:

print df
   id                                           features  label
0   1  name=John Matthew;age=25;1.=Post Graduate;2.=F...      1
1   2  name=Mark clark;age=21;1.=Under Graduate;2.=Fo...      1
2   3     name=David;age=12;1:=High School;2:=Cricketer;      2
3   4    name=George;age=11;1:=High School;2:=Carpenter;      2

Now we can create the needed list of dictionnaries with the following code : 现在我们可以使用以下代码创建所需的字典列表:

feat=[]
for line in df['features']:
    line=line.replace(':','.')
    lsp=line.split(';')[:-1]
    feat.append(dict([elt.split('=') for elt in lsp]))

And the resulting dataframe : 结果数据帧:

print pd.DataFrame(feat)
               1.               2. age          name
0   Post Graduate  Football Player  25  John Matthew
1  Under Graduate  Football Player  21    Mark clark
2     High School        Cricketer  12         David
3     High School        Carpenter  11        George

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM