[英]feature selection from subset of dataframe
I'm working with a DNS traffic dataset where I have some relevant IPs (8 relevant users) that I want to filter its traffic.我正在使用一个 DNS 流量数据集,其中有一些我想要过滤其流量的相关 IP(8 个相关用户)。 I have 100 json files that each one represents one day (session) of traffic.
我有 100 个 json 文件,每个文件代表一天(会话)的流量。 I want a matrix of occurrences from its values according to a column (dns_querty) because I'm training a ML algorithm with this data.
我想要一个根据列(dns_querty)从其值中出现的矩阵,因为我正在用这些数据训练一个机器学习算法。 Lets say I have the following columns:
可以说我有以下列:
The only relevant columns for me are the dns_query
and s_ip
which means that I have the source IP and the domain requested.对我来说唯一相关的列是
dns_query
和s_ip
,这意味着我有源 IP 和请求的域。 To this end I've tried different ways but I'm stuck.为此,我尝试了不同的方法,但我被困住了。
import os
import pandas as pd
import numpy as np
import json
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfTransformer, TfidfVectorizer
user1 = ['10.0.0.44'] # test with 1 user
real_users = ['10.0.0.44','10.0.0.60','10.0.0.33','10.0.0.32','10.0.0.42','10.0.0.31',
'10.0.0.34','10.0.0.29'] #real users
flag = 0
f = '/content/drive/MyDrive/anon_dns_data' #folder with 100 files
#trying different feature selection methods
count_vectorizer = CountVectorizer()
hash_vectorizer = HashingVectorizer()
tfidf_trans = TfidfTransformer()
tfidf_vectorizer = TfidfVectorizer()
try:
for root, dirs, files in os.walk(f):
flag+=1 #flag to control the days of traffic
for filename in files:
files = os.path.join(root, filename)
data = pd.read_json(files)
print(files)
columns = data.loc[:,['s_ip','dns_query']] #get only relevant columns
subset = columns[columns["s_ip"].isin(user1)] #filter by ip
print(subset[:50], subset.shape) #this line prints the image 2
a = count_vectorizer.fit_transform(subset)
b = hash_vectorizer.fit_transform(subset)
d = tfidf_vectorizer.fit_transform(subset)
if flag == 1:
break
except Exception as e:
print(e)
#print(a.toarray(),a.shape)
b.toarray()
#print(b[50:])
#print(d.toarray(),d.shape)
The above image represents the domains requested by one user:上图表示一位用户请求的域:
To be more specific, I want a matrix like the following example from sklearn, lets say we have a corpus with 4 elements (to me each element of the list represents a day of traffic that I'm treating as a dataframe):更具体地说,我想要一个类似于 sklearn 的以下示例的矩阵,假设我们有一个包含 4 个元素的语料库(对我来说,列表中的每个元素代表我将其视为数据帧的一天流量):
where each row represents one day of traffic of one user only.其中每一行仅代表一个用户一天的流量。 Meaning that the first 8 rows of N number of columns (n requested domains) represent one day of traffic.
这意味着 N 列(n 个请求的域)的前 8 行代表一天的流量。 So if I try with 10 days, this means that I should have a matrix of 8*10 = 80 rows by N columns.
因此,如果我尝试 10 天,这意味着我应该有一个 8*10 = 80 行乘 N 列的矩阵。 How can I achieve something like this and which class of feature selection/extraction of sklearn would fit my problem?
我怎样才能实现这样的目标以及 sklearn 的哪类特征选择/提取适合我的问题? Any help/guidance will be appreciated!
任何帮助/指导将不胜感激!
Here's one way to use CountVectorizer
on dns_query
for the groups I think you want.这是我认为您想要的组在
dns_query
上使用CountVectorizer
的一种方法。
Python code summary: Python代码总结:
df
df
groupby
s_ip
and day ( timestamp.date()
) into df_groupby
groupby
s_ip
和 day ( timestamp.date()
) 到df_groupby
new_df
with groups and join
'ed dns_query
strings ( " "
separator)new_df
并join
'ed dns_query
字符串( " "
分隔符)import CountVectorizer
vectorizer
with custom tokenizer
to just split on white spacetokenizer
器指定vectorizer
化器以仅在空白处拆分fit_transform
fit_transform
X
array resultX
数组结果Some steps can be combined, etc., but I wanted to demonstrate the technique and show some intermediate results.一些步骤可以组合,等等,但我想演示该技术并显示一些中间结果。 You will need to adapt this to your data.
您需要根据您的数据进行调整。
NB: If I understand the CountVectorizer
properly, you will need to run it so that all possible dns_query
strings are present somewhere when you run fit_transform
(like I've done here), or you will need to specify a full vocabulary
for CountVectorizer
so that in the end a meaningful matrix can be generated.注意:如果我正确理解
CountVectorizer
,您将需要运行它,以便在运行fit_transform
时所有可能的dns_query
字符串都出现在某处(就像我在这里所做的那样),或者您需要为CountVectorizer
指定一个完整的vocabulary
,以便最后可以生成一个有意义的矩阵。
$ ipython
Python 3.10.4 (main, Mar 25 2022, 00:00:00) [GCC 11.2.1 20220127 (Red Hat 11.2.1-9)]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.4.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import pandas as pd
In [2]: df = pd.read_json("dns_jq.json", orient="records")
In [3]: df
Out[3]:
s_ip dns_query timestamp
0 93.247.220.198 dynamicreal-time.org 2022-01-02 07:28:47+00:00
1 89.121.211.207 nationalintegrate.name 2022-01-02 22:01:08+00:00
2 94.6.90.22 productstrategic.org 2022-01-04 20:07:59+00:00
3 154.147.200.177 districtuser-centric.io 2022-01-02 08:21:11+00:00
4 50.104.137.53 dynamice-commerce.biz 2022-01-02 13:10:44+00:00
.. ... ... ...
95 77.236.52.126 districtinterfaces.info 2022-01-05 19:14:12+00:00
96 93.247.220.198 internalimplement.name 2022-01-04 02:18:44+00:00
97 89.121.211.207 globalsyndicate.name 2022-01-03 05:20:20+00:00
98 94.6.90.22 internalrepurpose.io 2022-01-04 01:05:23+00:00
99 154.147.200.177 dynamicreal-time.org 2022-01-01 17:21:45+00:00
[100 rows x 3 columns]
In [4]: df.s_ip.unique()
Out[4]:
array(['93.247.220.198', '89.121.211.207', '94.6.90.22',
'154.147.200.177', '50.104.137.53', '64.0.100.231',
'55.209.226.216', '77.236.52.126'], dtype=object)
In [5]: df.dns_query.unique()
Out[5]:
array(['dynamicreal-time.org', 'nationalintegrate.name',
'productstrategic.org', 'districtuser-centric.io',
'dynamice-commerce.biz', 'forwardintuitive.io',
'corporateseize.org', 'districtinterfaces.info',
'internalimplement.name', 'globalsyndicate.name',
'internalrepurpose.io'], dtype=object)
In [6]: df_groupby = df.groupby(lambda k: (df.iloc[k].s_ip, df.iloc[k].timestamp.date()))
In [7]: df_groupby
Out[7]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f6eb79fea10>
In [8]: df_groupby.groups
Out[8]: {('154.147.200.177', 2022-01-01): [99], ('154.147.200.177', 2022-01-02): [3, 11, 19, 27, 51, 83], ('154.147.200.177', 2022-01-03): [67], ('154.147.200.177', 2022-01-04): [35, 43, 59, 75, 91], ('50.104.137.53', 2022-01-01): [28, 36, 52, 60], ('50.104.137.53', 2022-01-02): [4, 20, 44], ('50.104.137.53', 2022-01-03): [76], ('50.104.137.53', 2022-01-04): [12, 68, 84, 92], ('55.209.226.216', 2022-01-01): [6, 14, 30, 86], ('55.209.226.216', 2022-01-02): [38, 54, 70], ('55.209.226.216', 2022-01-03): [46, 78, 94], ('55.209.226.216', 2022-01-04): [62], ('55.209.226.216', 2022-01-05): [22], ('64.0.100.231', 2022-01-01): [29, 77], ('64.0.100.231', 2022-01-02): [37, 45, 53, 85], ('64.0.100.231', 2022-01-03): [21], ('64.0.100.231', 2022-01-04): [13, 61], ('64.0.100.231', 2022-01-05): [5, 69, 93], ('77.236.52.126', 2022-01-01): [47, 79], ('77.236.52.126', 2022-01-02): [15], ('77.236.52.126', 2022-01-03): [7, 23, 39], ('77.236.52.126', 2022-01-04): [31, 71, 87], ('77.236.52.126', 2022-01-05): [55, 63, 95], ('89.121.211.207', 2022-01-01): [17], ('89.121.211.207', 2022-01-02): [1, 41, 57], ('89.121.211.207', 2022-01-03): [9, 25, 33, 65, 73, 97], ('89.121.211.207', 2022-01-04): [81, 89], ('89.121.211.207', 2022-01-05): [49], ('93.247.220.198', 2022-01-01): [32], ('93.247.220.198', 2022-01-02): [0, 48, 56, 64, 80, 88], ('93.247.220.198', 2022-01-03): [8, 72], ('93.247.220.198', 2022-01-04): [24, 96], ('93.247.220.198', 2022-01-05): [16, 40], ('94.6.90.22', 2022-01-02): [42, 50, 74], ('94.6.90.22', 2022-01-03): [26, 90], ('94.6.90.22', 2022-01-04): [2, 10, 18, 58, 66, 98], ('94.6.90.22', 2022-01-05): [34, 82]}
In [9]: new_df=pd.DataFrame({"group": df_groupby.groups.keys(), "dns_queries":[" ".join(df.loc[k].dn
...: s_query.values) for k in df_groupby.groups.values()]})
In [10]: new_df
Out[10]:
group dns_queries
0 (154.147.200.177, 2022-01-01) dynamicreal-time.org
1 (154.147.200.177, 2022-01-02) districtuser-centric.io dynamicreal-time.org i...
2 (154.147.200.177, 2022-01-03) nationalintegrate.name
3 (154.147.200.177, 2022-01-04) productstrategic.org internalrepurpose.io dyna...
4 (50.104.137.53, 2022-01-01) corporateseize.org districtuser-centric.io int...
5 (50.104.137.53, 2022-01-02) dynamice-commerce.biz globalsyndicate.name dyn...
6 (50.104.137.53, 2022-01-03) internalrepurpose.io
7 (50.104.137.53, 2022-01-04) nationalintegrate.name productstrategic.org di...
8 (55.209.226.216, 2022-01-01) corporateseize.org districtuser-centric.io int...
9 (55.209.226.216, 2022-01-02) forwardintuitive.io internalrepurpose.io dynam...
10 (55.209.226.216, 2022-01-03) productstrategic.org nationalintegrate.name co...
11 (55.209.226.216, 2022-01-04) districtinterfaces.info
12 (55.209.226.216, 2022-01-05) dynamicreal-time.org
13 (64.0.100.231, 2022-01-01) districtinterfaces.info dynamicreal-time.org
14 (64.0.100.231, 2022-01-02) dynamice-commerce.biz nationalintegrate.name g...
15 (64.0.100.231, 2022-01-03) internalrepurpose.io
16 (64.0.100.231, 2022-01-04) productstrategic.org corporateseize.org
17 (64.0.100.231, 2022-01-05) forwardintuitive.io districtuser-centric.io fo...
18 (77.236.52.126, 2022-01-01) districtuser-centric.io productstrategic.org
19 (77.236.52.126, 2022-01-02) dynamice-commerce.biz
20 (77.236.52.126, 2022-01-03) districtinterfaces.info nationalintegrate.name...
21 (77.236.52.126, 2022-01-04) globalsyndicate.name forwardintuitive.io inter...
22 (77.236.52.126, 2022-01-05) dynamicreal-time.org internalimplement.name di...
23 (89.121.211.207, 2022-01-01) corporateseize.org
24 (89.121.211.207, 2022-01-02) nationalintegrate.name internalimplement.name ...
25 (89.121.211.207, 2022-01-03) globalsyndicate.name districtuser-centric.io d...
26 (89.121.211.207, 2022-01-04) dynamice-commerce.biz nationalintegrate.name
27 (89.121.211.207, 2022-01-05) forwardintuitive.io
28 (93.247.220.198, 2022-01-01) internalrepurpose.io
29 (93.247.220.198, 2022-01-02) dynamicreal-time.org dynamice-commerce.biz nat...
30 (93.247.220.198, 2022-01-03) internalimplement.name corporateseize.org
31 (93.247.220.198, 2022-01-04) productstrategic.org internalimplement.name
32 (93.247.220.198, 2022-01-05) forwardintuitive.io districtinterfaces.info
33 (94.6.90.22, 2022-01-02) globalsyndicate.name corporateseize.org intern...
34 (94.6.90.22, 2022-01-03) dynamice-commerce.biz productstrategic.org
35 (94.6.90.22, 2022-01-04) productstrategic.org internalrepurpose.io dist...
36 (94.6.90.22, 2022-01-05) nationalintegrate.name forwardintuitive.io
In [11]: from sklearn.feature_extraction.text import CountVectorizer
In [12]: vectorizer = CountVectorizer(lowercase=False, tokenizer=lambda s: s.split())
In [13]: X = vectorizer.fit_transform(new_df["dns_queries"].values)
In [14]: X.toarray()
Out[14]:
array([[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1],
[1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0],
[0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
[0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1],
[1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1],
[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0],
[0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1],
[0, 1, 1, 0, 1, 0, 2, 0, 1, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0],
[0, 0, 1, 1, 2, 0, 1, 0, 0, 1, 0],
[1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1],
[0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0],
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1],
[0, 1, 1, 0, 1, 0, 0, 0, 2, 0, 1],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0]])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.