admin管理员组文章数量:1566600
2024年6月13日发(作者:)
A Static Malicious Javascript Detection Using SVM
WANG Wei-Hong, LV Yin-Jun, CHEN Hui-Bing,
FANG Zhao-Lin
Zhejiang University of Technology
HangZhou, China
Abstract—Malicious script,such as JavaScript, is one of the
primary threats of the network security. JavaScript is not only
a browser scripting language that allows developers to create
sophisticated client-side interfaces for web applications, but
also used to carry out attacks taht used to steal users'
credentials and lure users into providing sensitive information
to unauthorized parties. We propose a static malicious
JavaScript detection techniques based on SVM(Support Vector
Machine). Our approach combines static detection with
machine learning technique, to analyze and extract malicious
script features,and use the machine learning technology,SVM,
to classify the technique has the characteristics of
high detection rate,low false positive rate and the detection of
unknown attacks. Applied to experiments on the prepared data
set, we achieved excellent detection performance.
WANG Wei-Hong, LV Yin-Jun, CHEN Hui-Bing,
FANG Zhao-Lin
Zhejiang University of Technology
HangZhou, China
static characteristics information of the file, to distinguish the
malicious script and the benign script[4]. This article uses
machine learning techniques to analyze the feature of the
script, proposes a static detection method based on SVM.
II. M
ALICIOUS
S
CRIPT
F
EATURE
E
XTRACTION
JavaScript[5] is a lightweight, object-based and event-
driven scripting language. JavaScript based on HTML could
develop interactive Web pages, making web users achieve
real-time, dynamic interaction [6]. However, JavaScript is
also an attractive choice for attackers to implement their
assaults and distribute them over the Internet., such as cross-
site scripting attacks, SQL injection attacks and passive
download attack.
According to a survey to 90 sites in the China Education
Keywords-Keywords; SVM; static detection; malicious script
and Research Network (CERNET) in 2008, nearly one-third
detection
of the sites was attacked. And 39% of the attacks is caused
by the malicious JavaScript [6]. Its characteristics make
JavaScript easy to become a carrier of malicious programs.
I.
I
NTRODUCTION
JavaScript has two characteristics: First, JavaScript, a
With the rapid development of network information
description language as a file, can be executed directly
technology, information security issues gains more and more
through the browser; Second, Without protection, JavaScript
attentions. The malicious script is one of the primay security
written in the HTML can be seen and copy by anyone
threats of computer networks. By constructing a special web
directly.
page, which contains Trojans, viruses, worms, or aggressive
Therefore, these characteristics have made JavaScript the
programs, malicious script propagate to the user's computer
one of attackers' favorite tools. To solve this problem, sand-
when the user access to these pages.
boxing mechanism is provided to prevent malicious
Based on the execution state of malicious script, the
JavaScript from compromising the security of client's
current detection methods of malicious script can be divided
environment[8]. And it allows the code to perform a
into the static analysis and dynamic analysis method:
restricted set of operations only. What's more, the sand-
Without executing the script, the static analysis method
boxing mechanism not only brings the problem of efficiency,
uses the static characteristic, the structure of the scripts to
but also constraints the execution of JavaScript in client. In
identify malicious scripts, take [1] as example, it counts
this paper, we turn to machine learning classification
malicious signatures, then weights the different statistical
techniques to solve this problem.
methods with Judgment matrix method, and at last uses the
To achieve this goal, features are analyzed and extracted
weighted geometric mean method to obtain the results. This
at first. According to [9], we can extract 17 malicious
method not only requires some obvious features, but also
JavaScript features. And 10 features more are added based
weak at finding unknown attacks.
on the analysis of the data. The part of 27 features are
Dynamic analysis method, which runs malicious scripts
explained as follows:
in the controlled environment, detects malicious scripts by
In most benign cases, the number of some special
observing the execution states, processes. In [2][3], they
functions is limited while there are a relatively large number
monitor system ports, network connections, the registry,
of these functions in malicious script, such as the eval
system configuration files , to detect abnormal procedures.
function, escape function,DOM-modifying function. The
The method has to run malicious code, which increases the
exploits usually call several of DOM functions in order to
risk of the system, and the efficiency is also a problem.
instantiate vul-nerable components and/or create elements in
Malicious script is the special code hidden in the
the page for the pur-pose of loading external scripts and
scripting language, such as js files. Thanks to its
exploit the escape function could be called to
standardized script format, grammar, we tend to get enough
Published by Atlantis Press, Paris, France.
© the authors
0214
code malicious abnormal use of special keyword,
tag,string are also considered.
Unfortunately, obfuscation techniques, which was
intended to protect the source code, is taken by the attackers
to circumvent these feature extraction. In order to reduce the
impact of the obfuscation, we also do a certain degree of
strength analysis [10]. Some features such as the scripts'
whitespace percentage, the maximum entropy of the strings,
the entropy of the script, are measured. Table.I shows one of
the results :
TABLE I.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
27
FEATURES OF DATASET
the number of DOM
modification functions
the script’s whitespace
percentage
the average length of the
strings used in the script
the average script line length
the number of strings
containing “iframe”
the number of suspicious tag
strings
the length of the script in
characters
the number of unescape and
escape
the number of eval()
15
function
the number of the
16
setTimeout() functions
the ratio between
17
keywords and words
the number of built-in
functions used for 18
deobfuscation
the entropy of the strings
19
declared in the script
the entropy of the script
20
as a whole
the number of long
21
strings(>40)
the maximum entropy of
22
all the script’s strings
the probability of the
script to contain 23
shellcode
the maximum length of
24
the script’s strings
the number of string
25
direct assignments
the number of string
26
modification functions
the number of event
27
attachments
the number of suspicious
strings
SVM, which creates a feature space with the attributes in the
training dataset, is to search a decision boundary or an
optimal hyperplane to separates the feature space with the
maximum interval,as shown in Fig.1.
There are two types of SVM. The linear SVM which
separates the data points with a linear boundary and the non-
linear SVM which separates the data points with a nonlinear
boundary.
In the case of linearly separable problems, it is easy to
find the plane in the feature space that separate two types of
samples. Therefore, our optimal plane is the one that has
maximum geometry interval. As the following formulas
shows:
1
min||ω||
2
2
s.t.,y
i
(ω
T
⋅
x
i
+
b)
≥
1,i
=
1,
,n
Obviously, it's a convex quadratic programming
problems. To solve this problem, firstly, the Lagrange
function should be brought in to turn it to its dual
problem,.The slack variable and penalty function are
proposed to deal with linearly inseparable problem caused by
noise. Then the objective function convert to:
n
1
2
min||w||
+
C
ξ
i
2
i
=
1
s.t.,y
i
(w
T
x
i
+
b)
≥
1
−
ξ
i
,i
=
1,
……
,n
ξ
i
≥
0,i
=
1,
……
,n
Linear SVM performs well on datasets that can be easily
separated by a hyper-plane into two parts. But sometimes
datasets are complex and are difficult to classify using a
linear kernel. Non-linear SVM classifiers can be used for
the number of classid
such complex datasets.
In the non-linear case, it maps the data into a high
the number of parseInt and
dimensional space, where an optimal separating hyperplane
fromcharcode
would be found. With appropriate mapping function, most of
the ratio between
the non-linear problem can be transformed into the linear
n and line
problem in high-dimensional space. However, the high-
dimensional mapping also brings the curse of dimensionality,
the number of chars in hex
and it is a disaster to calculate separating hyperplane in the
feature space. The inner product can be realized in the
the number of
feature space with kernel function satisfies Mercer, which is
CreateObject,ActiveXObject
a trick to this problem:
max
α
i
−
i
=
1
n
1
n
α
i
α
j
y
i
y
j
k
(
x
i
,
x
j
)
2
i
,
j
=
1
III. M
ALICIOUS SCRIPT DETECTION BASED ON
SVM
s
.
t
.,
α
i
≥
0,
i
=
1,
,
n
n
α
i
y
i
=
0
The machine learning technology,SVM, which could
i
=
1
help summarize the knowledge of identifying known
Common kernel functions are polynomial kernel,
malicious JavaScript, carry out a similarity search to find
Gaussian kernel, Sigmoid kernel function. Gaussian kernel is
unknown malicious JavaScript, with a high detection rate
a universal nuclear function, by selecting the appropriate
and low false alarm rate [11].
parameters, it can achieve a high correct rate. Gaussian
kernel:
A. SVM
k
(
x
i
,
x
j
)
=
exp(
−
γ
⋅
||
x
i
−
x
j
||
2
),
γ
>
0
SVM (Support Vector Machine), originated in statistical
learning theory by Vapnik et al in 1995, was focused on
pattern classification problems [12]. It is a statistical learning
algorithm that classifies the samples using a subset of
training samples called support simple terms,
Published by Atlantis Press, Paris, France.
© the authors
0215
IV. E
XPERIMENTAL ANALYSIS AND IMPLEMENTATION
A. Experimental Analysis
The experimental data is composed of 1000 malicious
JavaScript collected from VX Heavens [13] and 1000 benign
ones from reputable sites. The dataset is divided into three,
one third as the training set and two thirds as the test set.
According to the analysis previously,we extract 27
features of the dataset, scale on the extracted features, and
converts it into WEKA file format.
The above shows that , SVM obtains more than
90% both on accuracy and recall, and the accuracy on the
Figure 1. Optimal hyperplane
training set even raised to 93.8% . SVM shows a better
accuracy even in the case of less training samples.
In this paper, we choose the RBF kernel to get the best
B. The malicious script analysis framework based on SVM
classification model. Two parameters would be adjusted, the
As mentioned before, the script analysis can be divided
penalty factor C and kernel function parameter γ.
into static analysis and dynamic analysis. Here, we propose
C is used to weigh the "Find largest interval hyperplane"
an SVM-based static analysis method, combined with
and "make sure minimum deviation of the data points", C set
machine learning classification techniques, to distinguish
large value easily causes overlearning, and reduceing the
malicious scripts and benign script. Its script training
generalization performance. When set small value, it results
flowchart and script test flow chart are shown in Fig.2.
in less learning, which all the sample are classified into the
a) Dataset preparation: collect enough malicious
strong class. γ stands for the nuclear radius, directly impacts
JavaScript and benign JavaScript from the site.
the classification performance of SVM. With too large value,
it will end in zero generalization ability, while with too small
b) Data cleaning: cleaning the sample data, such as the
removal of the Notes, excess carriage return and line feed,
value, the classify ability of new samples close to zero,even
it has a high accuracy on the training set[14].
which increases the processing speed and accuracy.
The optimization algorithm, GridSearch on WEKA, is
c) Feature extraction: extract 27 features based on the
used in this paper to search the optimal
analysis above.
accurately rate as criterion, 1 as Step of C, γ steps as a base
d) Pre-treatment: data normalization processing, scaled
unit, and obtain the experimental results of . when C
to [0,1]. This process reduces the training error while the
= 27, γ= 4, the training set accuracy up to 96.59%. And get
data characteristic value is too large, or too small. Second,
the best optimization model parameter training.
the efficiency could also be improved.
As shown in , SVM gains higher accuracy on
the training set and a test set than ADTree and NaveBayes.
e) Parameter tuning: the WEKA is the platform to train
models. With a grid of binary classification SVM traverse
NaveBayes even don't obtains 90%, while SVM has an
GridSearch algorithm and ten-fold cross-validation,it selects
accuracy of 94.38% on the test set. It is clear that the SVM is
better at handling binary class.
the best SVM model parameters.
These experimental results shows that, the static
f) Model training: training best SVM model to obtain
detection method based on SVM we proposed, is excellent
the optimal parameters.
both on the accuracy and detection efficiency.
g) The data prediction: using the best model to predict
TABLE II.
THE
WEKA
FILE OF
N
ORMALIZED EIGENVECTORS
the classification of the test set.
malicious
benign
average
TP FP
Rate Rate
0.912 0.038
0.962 0.088
0.937 0.064
PrecisiRecall F-
on Measure
0.958 0.912 0.934
0.919 0.962 0.940
0.938 0.937 0.937
Roc
Area
0.937
0.937
0.937
TABLE III.
C
1~128
1~128
P
ARAMETER OPTIMIZATION WITH
G
RID
S
EARCH
γ
2
−
10
~2
6
Optimal
parameter
C=27
γ=4
C=30
γ=1
C=8
γ=5
Accuracy of
training set
Accuracy of
test set
96.59% 94.38%
95.48% 95.46%
96.48% 93.38%
3
−
10
~3
6
Figure 2. The flowchart of malicious JavaScript
1~128
5
−
10
~5
6
Published by Atlantis Press, Paris, France.
© the authors
0216
TABLE IV.
This paper proposed a SVM-based malicious JavaScript
detection method, which,based on fully analysis of scripting
Learning algorithm Accuracy of training set Accuracy of test set
language, extracts the static information of the script, and
ADTree
94.94% 91.68%
improves the detection efficiency and safety of the system,
NaiveBayes
86.36% 84.31%
without parsing and compiling the script; The SVM has a
96.59% 94.38%
SVM
good reputation in the practical application of machine
learning, and helps detect unknown attacks. The
experimental results show that this method has a high
B. System implementation
accuracy and low false alarm , and could detect unknown
attacks.
The implementation of prototype system for the detection
of malicious JavaScript is introduced in this section. The
A
CKNOWLEDGMENT
system can directly detect a JavaScript script, or deal with a
URL to detect the JavaScript the page contains.
This research was supported by grant R1090569 and
The module of feature extraction and SVM detection is
LY12F02039 from the Natural Science Foundation of
developed by C, while the script extraction is by PHP. As
Zhejiang Province.
shown in Figure 3.
R
EFERENCES
1) Script extraction module:
This module is developed for the user as interface,
[1]
Hao Zhang, Ran Tao, Zhiyong Li, Hua DU , The Detection
Methods of Malicious Script . Ordnance Academic Journal,
provides services of script detection . Users can either choose
2008.
to upload a JavaScript script, or submit a URL address. This
module will analyze the page, and extract the JavaScript and
[2]
Ming Zhu, Qian Xu, Chunming Liu. The Analysis And
then package to feature extraction module for further
Detection Of Trojan, Computer Engineering and Applications,
2003.
analysis.
2) Feature extraction module:
[3]
Oystein Hallaraker, Giovanni Vigna, Detecting Malicious
JavaScript Code in , 2005.
Firstly,this module would accept the JavaScript from the
last one, do data cleaning and remove extra blank lines,
[4]
Min Dai, Ya-Lou Huang, Wei Wang, Trojan detection Model
comments and so on; Then extract 27 feature previously
Based On Static File Information. Computer Engineering,
mentioned; At last, the data is scaled to [0,1] to improve
2006,3 (6): 176 - 179.
computational efficiency, and converted into the standard
[5]
D. Flanagan. JavaScript: The Definitive Guide, 4th
form of the next detection module.
er 2001.
3) SVM detection module
[6]
Yinhe Zhang, Wenxin Liang, Xinlei Li, Self-study manual of
The model used here,SVM, is trained with optimal
JavaScript . Tsinghua University Press,2008-10.
parameters. It accepts a standard data from feature extraction
[7]
Bin Liang, Jianjun Huang, Fang Liu, Dawei Wang, Daxiang
module. Detected by SVM, the results are then delivered to
Dong, Zhaohui Liang. Malicious Web Pages Detection Based
display in the script extraction module.
on Abnormal Visibility Recognition. IEEE, 2009.
[8]
V. Anupam and AJ Mayer. "Secure Web Scripting". IEEE
T
HE COMPARISON OF THE ACCURACY OF TRAINING SET
AND TEST SET OF
SVM,
ADT
REE
,
AND
N
AVE
B
AYES ALGORITHM
V. C
ONCLUSIONS
Internet Computing, 1998, 2 (6) :46-55.
[9]
Likarish P., Jung E., Jo I. Obfuscated malicious JavaScript
detection using classification techniques. IEEE :47-54.
[10]
Byung-Ik Kim, Chae-Tae Im, Hyun-Chul Jung. Suspicious
Malicious Web Site Detection with Strength Analysis of a
JavaScript ational Journal of Advanced
Science.2011.
[11]
Xiaokang Zhang. Malicious code detection technology based
on data mining and machine learning research [D]. Master's
degree thesis of USTC .2010.
[12]
Vapnik VN The nature of statistical learning theory [M].
Springer Verlag, 2000.
[13]
VX Heavens. Http:/// [EB / OL]. 2006-09-28.
[14]
Xiaofei Yan, Hongwei Ge, Sheng Yan, RBF kernel SVM and
Its Application, Computer Engineering and Design, 2006.
Figure 3. The implementation of prototype system
Published by Atlantis Press, Paris, France.
© the authors
0217
版权声明:本文标题:静态恶意JavaScript检测:支持向量机(SVM)方法说明书 内容由热心网友自发贡献,该文观点仅代表作者本人, 转载请联系作者并注明出处:https://www.elefans.com/dongtai/1718214505a654121.html, 本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容,一经查实,本站将立刻删除。
发表评论