简体   繁体   English

从数据帧的子集中选择特征

[英]feature selection from subset of dataframe

I'm working with a DNS traffic dataset where I have some relevant IPs (8 relevant users) that I want to filter its traffic.我正在使用一个 DNS 流量数据集,其中有一些我想要过滤其流量的相关 IP(8 个相关用户)。 I have 100 json files that each one represents one day (session) of traffic.我有 100 个 json 文件,每个文件代表一天(会话)的流量。 I want a matrix of occurrences from its values according to a column (dns_querty) because I'm training a ML algorithm with this data.我想要一个根据列(dns_querty)从其值中出现的矩阵,因为我正在用这些数据训练一个机器学习算法。 Lets say I have the following columns:可以说我有以下列:

在此处输入图像描述

The only relevant columns for me are the dns_query and s_ip which means that I have the source IP and the domain requested.对我来说唯一相关的列是dns_querys_ip ,这意味着我有源 IP 和请求的域。 To this end I've tried different ways but I'm stuck.为此,我尝试了不同的方法,但我被困住了。

import os
import pandas as pd
import numpy as np 
import json 

from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfTransformer, TfidfVectorizer

user1 = ['10.0.0.44'] # test with 1 user
real_users = ['10.0.0.44','10.0.0.60','10.0.0.33','10.0.0.32','10.0.0.42','10.0.0.31',
          '10.0.0.34','10.0.0.29'] #real users

flag = 0
f = '/content/drive/MyDrive/anon_dns_data' #folder with 100 files

#trying different feature selection methods
count_vectorizer = CountVectorizer()
hash_vectorizer = HashingVectorizer()
tfidf_trans = TfidfTransformer()
tfidf_vectorizer = TfidfVectorizer()

try:
  for root, dirs, files in os.walk(f):
      flag+=1 #flag to control the days of traffic
      for filename in files:
          files = os.path.join(root, filename)
          data = pd.read_json(files)
          print(files)
          columns = data.loc[:,['s_ip','dns_query']] #get only relevant columns
          subset = columns[columns["s_ip"].isin(user1)] #filter by ip
          print(subset[:50], subset.shape) #this line prints the image 2
          a = count_vectorizer.fit_transform(subset)
          b = hash_vectorizer.fit_transform(subset)
          d = tfidf_vectorizer.fit_transform(subset)
          if flag == 1:
            break
except Exception as e:
  print(e)
#print(a.toarray(),a.shape)
b.toarray()
#print(b[50:])
#print(d.toarray(),d.shape)

The above image represents the domains requested by one user:上图表示一位用户请求的域:

在此处输入图像描述

To be more specific, I want a matrix like the following example from sklearn, lets say we have a corpus with 4 elements (to me each element of the list represents a day of traffic that I'm treating as a dataframe):更具体地说,我想要一个类似于 sklearn 的以下示例的矩阵,假设我们有一个包含 4 个元素的语料库(对我来说,列表中的每个元素代表我将其视为数据帧的一天流量):

在此处输入图像描述

where each row represents one day of traffic of one user only.其中每一行仅代表一个用户一天的流量。 Meaning that the first 8 rows of N number of columns (n requested domains) represent one day of traffic.这意味着 N 列(n 个请求的域)的前 8 行代表一天的流量。 So if I try with 10 days, this means that I should have a matrix of 8*10 = 80 rows by N columns.因此,如果我尝试 10 天,这意味着我应该有一个 8*10 = 80 行乘 N 列的矩阵。 How can I achieve something like this and which class of feature selection/extraction of sklearn would fit my problem?我怎样才能实现这样的目标以及 sklearn 的哪类特征选择/提取适合我的问题? Any help/guidance will be appreciated!任何帮助/指导将不胜感激!

Here's one way to use CountVectorizer on dns_query for the groups I think you want.这是我认为您想要的组在dns_query上使用CountVectorizer的一种方法。

Python code summary: Python代码总结:

  1. import my toy DNS JSON records into df将我的玩具 DNS JSON 记录导入df
  2. groupby s_ip and day ( timestamp.date() ) into df_groupby groupby s_ip和 day ( timestamp.date() ) 到df_groupby
  3. create new_df with groups and join 'ed dns_query strings ( " " separator)使用组创建new_dfjoin 'ed dns_query字符串( " "分隔符)
  4. import CountVectorizer
  5. specify vectorizer with custom tokenizer to just split on white space使用自定义tokenizer器指定vectorizer化器以仅在空白处拆分
  6. do the fit_transformfit_transform
  7. show the X array result显示X数组结果

Some steps can be combined, etc., but I wanted to demonstrate the technique and show some intermediate results.一些步骤可以组合,等等,但我想演示该技术并显示一些中间结果。 You will need to adapt this to your data.您需要根据您的数据进行调整。

NB: If I understand the CountVectorizer properly, you will need to run it so that all possible dns_query strings are present somewhere when you run fit_transform (like I've done here), or you will need to specify a full vocabulary for CountVectorizer so that in the end a meaningful matrix can be generated.注意:如果我正确理解CountVectorizer ,您将需要运行它,以便在运行fit_transform时所有可能的dns_query字符串都出现在某处(就像我在这里所做的那样),或者您需要为CountVectorizer指定一个完整的vocabulary ,以便最后可以生成一个有意义的矩阵。

$ ipython
Python 3.10.4 (main, Mar 25 2022, 00:00:00) [GCC 11.2.1 20220127 (Red Hat 11.2.1-9)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: df = pd.read_json("dns_jq.json", orient="records")

In [3]: df
Out[3]: 
               s_ip                dns_query                 timestamp
0    93.247.220.198     dynamicreal-time.org 2022-01-02 07:28:47+00:00
1    89.121.211.207   nationalintegrate.name 2022-01-02 22:01:08+00:00
2        94.6.90.22     productstrategic.org 2022-01-04 20:07:59+00:00
3   154.147.200.177  districtuser-centric.io 2022-01-02 08:21:11+00:00
4     50.104.137.53    dynamice-commerce.biz 2022-01-02 13:10:44+00:00
..              ...                      ...                       ...
95    77.236.52.126  districtinterfaces.info 2022-01-05 19:14:12+00:00
96   93.247.220.198   internalimplement.name 2022-01-04 02:18:44+00:00
97   89.121.211.207     globalsyndicate.name 2022-01-03 05:20:20+00:00
98       94.6.90.22     internalrepurpose.io 2022-01-04 01:05:23+00:00
99  154.147.200.177     dynamicreal-time.org 2022-01-01 17:21:45+00:00

[100 rows x 3 columns]

In [4]: df.s_ip.unique()
Out[4]: 
array(['93.247.220.198', '89.121.211.207', '94.6.90.22',
       '154.147.200.177', '50.104.137.53', '64.0.100.231',
       '55.209.226.216', '77.236.52.126'], dtype=object)

In [5]: df.dns_query.unique()
Out[5]: 
array(['dynamicreal-time.org', 'nationalintegrate.name',
       'productstrategic.org', 'districtuser-centric.io',
       'dynamice-commerce.biz', 'forwardintuitive.io',
       'corporateseize.org', 'districtinterfaces.info',
       'internalimplement.name', 'globalsyndicate.name',
       'internalrepurpose.io'], dtype=object)

In [6]: df_groupby = df.groupby(lambda k: (df.iloc[k].s_ip, df.iloc[k].timestamp.date()))

In [7]: df_groupby
Out[7]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f6eb79fea10>

In [8]: df_groupby.groups
Out[8]: {('154.147.200.177', 2022-01-01): [99], ('154.147.200.177', 2022-01-02): [3, 11, 19, 27, 51, 83], ('154.147.200.177', 2022-01-03): [67], ('154.147.200.177', 2022-01-04): [35, 43, 59, 75, 91], ('50.104.137.53', 2022-01-01): [28, 36, 52, 60], ('50.104.137.53', 2022-01-02): [4, 20, 44], ('50.104.137.53', 2022-01-03): [76], ('50.104.137.53', 2022-01-04): [12, 68, 84, 92], ('55.209.226.216', 2022-01-01): [6, 14, 30, 86], ('55.209.226.216', 2022-01-02): [38, 54, 70], ('55.209.226.216', 2022-01-03): [46, 78, 94], ('55.209.226.216', 2022-01-04): [62], ('55.209.226.216', 2022-01-05): [22], ('64.0.100.231', 2022-01-01): [29, 77], ('64.0.100.231', 2022-01-02): [37, 45, 53, 85], ('64.0.100.231', 2022-01-03): [21], ('64.0.100.231', 2022-01-04): [13, 61], ('64.0.100.231', 2022-01-05): [5, 69, 93], ('77.236.52.126', 2022-01-01): [47, 79], ('77.236.52.126', 2022-01-02): [15], ('77.236.52.126', 2022-01-03): [7, 23, 39], ('77.236.52.126', 2022-01-04): [31, 71, 87], ('77.236.52.126', 2022-01-05): [55, 63, 95], ('89.121.211.207', 2022-01-01): [17], ('89.121.211.207', 2022-01-02): [1, 41, 57], ('89.121.211.207', 2022-01-03): [9, 25, 33, 65, 73, 97], ('89.121.211.207', 2022-01-04): [81, 89], ('89.121.211.207', 2022-01-05): [49], ('93.247.220.198', 2022-01-01): [32], ('93.247.220.198', 2022-01-02): [0, 48, 56, 64, 80, 88], ('93.247.220.198', 2022-01-03): [8, 72], ('93.247.220.198', 2022-01-04): [24, 96], ('93.247.220.198', 2022-01-05): [16, 40], ('94.6.90.22', 2022-01-02): [42, 50, 74], ('94.6.90.22', 2022-01-03): [26, 90], ('94.6.90.22', 2022-01-04): [2, 10, 18, 58, 66, 98], ('94.6.90.22', 2022-01-05): [34, 82]}

In [9]: new_df=pd.DataFrame({"group": df_groupby.groups.keys(), "dns_queries":[" ".join(df.loc[k].dn
   ...: s_query.values) for k in df_groupby.groups.values()]})

In [10]: new_df
Out[10]: 
                            group                                        dns_queries
0   (154.147.200.177, 2022-01-01)                               dynamicreal-time.org
1   (154.147.200.177, 2022-01-02)  districtuser-centric.io dynamicreal-time.org i...
2   (154.147.200.177, 2022-01-03)                             nationalintegrate.name
3   (154.147.200.177, 2022-01-04)  productstrategic.org internalrepurpose.io dyna...
4     (50.104.137.53, 2022-01-01)  corporateseize.org districtuser-centric.io int...
5     (50.104.137.53, 2022-01-02)  dynamice-commerce.biz globalsyndicate.name dyn...
6     (50.104.137.53, 2022-01-03)                               internalrepurpose.io
7     (50.104.137.53, 2022-01-04)  nationalintegrate.name productstrategic.org di...
8    (55.209.226.216, 2022-01-01)  corporateseize.org districtuser-centric.io int...
9    (55.209.226.216, 2022-01-02)  forwardintuitive.io internalrepurpose.io dynam...
10   (55.209.226.216, 2022-01-03)  productstrategic.org nationalintegrate.name co...
11   (55.209.226.216, 2022-01-04)                            districtinterfaces.info
12   (55.209.226.216, 2022-01-05)                               dynamicreal-time.org
13     (64.0.100.231, 2022-01-01)       districtinterfaces.info dynamicreal-time.org
14     (64.0.100.231, 2022-01-02)  dynamice-commerce.biz nationalintegrate.name g...
15     (64.0.100.231, 2022-01-03)                               internalrepurpose.io
16     (64.0.100.231, 2022-01-04)            productstrategic.org corporateseize.org
17     (64.0.100.231, 2022-01-05)  forwardintuitive.io districtuser-centric.io fo...
18    (77.236.52.126, 2022-01-01)       districtuser-centric.io productstrategic.org
19    (77.236.52.126, 2022-01-02)                              dynamice-commerce.biz
20    (77.236.52.126, 2022-01-03)  districtinterfaces.info nationalintegrate.name...
21    (77.236.52.126, 2022-01-04)  globalsyndicate.name forwardintuitive.io inter...
22    (77.236.52.126, 2022-01-05)  dynamicreal-time.org internalimplement.name di...
23   (89.121.211.207, 2022-01-01)                                 corporateseize.org
24   (89.121.211.207, 2022-01-02)  nationalintegrate.name internalimplement.name ...
25   (89.121.211.207, 2022-01-03)  globalsyndicate.name districtuser-centric.io d...
26   (89.121.211.207, 2022-01-04)       dynamice-commerce.biz nationalintegrate.name
27   (89.121.211.207, 2022-01-05)                                forwardintuitive.io
28   (93.247.220.198, 2022-01-01)                               internalrepurpose.io
29   (93.247.220.198, 2022-01-02)  dynamicreal-time.org dynamice-commerce.biz nat...
30   (93.247.220.198, 2022-01-03)          internalimplement.name corporateseize.org
31   (93.247.220.198, 2022-01-04)        productstrategic.org internalimplement.name
32   (93.247.220.198, 2022-01-05)        forwardintuitive.io districtinterfaces.info
33       (94.6.90.22, 2022-01-02)  globalsyndicate.name corporateseize.org intern...
34       (94.6.90.22, 2022-01-03)         dynamice-commerce.biz productstrategic.org
35       (94.6.90.22, 2022-01-04)  productstrategic.org internalrepurpose.io dist...
36       (94.6.90.22, 2022-01-05)         nationalintegrate.name forwardintuitive.io

In [11]: from sklearn.feature_extraction.text import CountVectorizer

In [12]: vectorizer = CountVectorizer(lowercase=False, tokenizer=lambda s: s.split())

In [13]: X = vectorizer.fit_transform(new_df["dns_queries"].values)

In [14]: X.toarray()
Out[14]: 
array([[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1],
       [1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0],
       [0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1],
       [1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
       [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0],
       [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0],
       [0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1],
       [0, 1, 1, 0, 1, 0, 2, 0, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 1, 1, 2, 0, 1, 0, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1],
       [0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1],
       [0, 1, 1, 0, 1, 0, 0, 0, 2, 0, 1],
       [0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0]])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM