静态恶意JavaScript检测:支持向量机(SVM)方法说明书|电子爱好者

admin管理员组
文章数量:1566600

2024年6月13日发(作者：)

A Static Malicious Javascript Detection Using SVM

WANG Wei-Hong, LV Yin-Jun, CHEN Hui-Bing,

FANG Zhao-Lin

Zhejiang University of Technology

HangZhou, China

Abstract—Malicious script,such as JavaScript, is one of the

primary threats of the network security. JavaScript is not only

a browser scripting language that allows developers to create

sophisticated client-side interfaces for web applications, but

also used to carry out attacks taht used to steal users'

credentials and lure users into providing sensitive information

to unauthorized parties. We propose a static malicious

JavaScript detection techniques based on SVM(Support Vector

Machine). Our approach combines static detection with

machine learning technique, to analyze and extract malicious

script features,and use the machine learning technology,SVM,

to classify the technique has the characteristics of

high detection rate,low false positive rate and the detection of

unknown attacks. Applied to experiments on the prepared data

set, we achieved excellent detection performance.

WANG Wei-Hong, LV Yin-Jun, CHEN Hui-Bing,

FANG Zhao-Lin

Zhejiang University of Technology

HangZhou, China

static characteristics information of the file, to distinguish the

malicious script and the benign script[4]. This article uses

machine learning techniques to analyze the feature of the

script, proposes a static detection method based on SVM.

II. M

ALICIOUS

CRIPT

EATURE

XTRACTION

JavaScript[5] is a lightweight, object-based and event-

driven scripting language. JavaScript based on HTML could

develop interactive Web pages, making web users achieve

real-time, dynamic interaction [6]. However, JavaScript is

also an attractive choice for attackers to implement their

assaults and distribute them over the Internet., such as cross-

site scripting attacks, SQL injection attacks and passive

download attack.

According to a survey to 90 sites in the China Education

Keywords-Keywords; SVM; static detection; malicious script

and Research Network (CERNET) in 2008, nearly one-third

detection

of the sites was attacked. And 39% of the attacks is caused

by the malicious JavaScript [6]. Its characteristics make

JavaScript easy to become a carrier of malicious programs.

NTRODUCTION

JavaScript has two characteristics: First, JavaScript, a

With the rapid development of network information

description language as a file, can be executed directly

technology, information security issues gains more and more

through the browser; Second, Without protection, JavaScript

attentions. The malicious script is one of the primay security

written in the HTML can be seen and copy by anyone

threats of computer networks. By constructing a special web

directly.

page, which contains Trojans, viruses, worms, or aggressive

Therefore, these characteristics have made JavaScript the

programs, malicious script propagate to the user's computer

one of attackers' favorite tools. To solve this problem, sand-

when the user access to these pages.

boxing mechanism is provided to prevent malicious

Based on the execution state of malicious script, the

JavaScript from compromising the security of client's

current detection methods of malicious script can be divided

environment[8]. And it allows the code to perform a

into the static analysis and dynamic analysis method:

restricted set of operations only. What's more, the sand-

Without executing the script, the static analysis method

boxing mechanism not only brings the problem of efficiency,

uses the static characteristic, the structure of the scripts to

but also constraints the execution of JavaScript in client. In

identify malicious scripts, take [1] as example, it counts

this paper, we turn to machine learning classification

malicious signatures, then weights the different statistical

techniques to solve this problem.

methods with Judgment matrix method, and at last uses the

To achieve this goal, features are analyzed and extracted

weighted geometric mean method to obtain the results. This

at first. According to [9], we can extract 17 malicious

method not only requires some obvious features, but also

JavaScript features. And 10 features more are added based

weak at finding unknown attacks.

on the analysis of the data. The part of 27 features are

Dynamic analysis method, which runs malicious scripts

explained as follows:

in the controlled environment, detects malicious scripts by

In most benign cases, the number of some special

observing the execution states, processes. In [2][3], they

functions is limited while there are a relatively large number

monitor system ports, network connections, the registry,

of these functions in malicious script, such as the eval

system configuration files , to detect abnormal procedures.

function, escape function,DOM-modifying function. The

The method has to run malicious code, which increases the

exploits usually call several of DOM functions in order to

risk of the system, and the efficiency is also a problem.

instantiate vul-nerable components and/or create elements in

Malicious script is the special code hidden in the

the page for the pur-pose of loading external scripts and

scripting language, such as js files. Thanks to its

exploit the escape function could be called to

standardized script format, grammar, we tend to get enough

Published by Atlantis Press, Paris, France.

0214

code malicious abnormal use of special keyword,

tag,string are also considered.

Unfortunately, obfuscation techniques, which was

intended to protect the source code, is taken by the attackers

to circumvent these feature extraction. In order to reduce the

impact of the obfuscation, we also do a certain degree of

strength analysis [10]. Some features such as the scripts'

whitespace percentage, the maximum entropy of the strings,

the entropy of the script, are measured. Table.I shows one of

the results :

TABLE I.

FEATURES OF DATASET

the number of DOM

modification functions

the script’s whitespace

percentage

the average length of the

strings used in the script

the average script line length

the number of strings

containing “iframe”

the number of suspicious tag

strings

the length of the script in

characters

the number of unescape and

escape

the number of eval()

function

the number of the

setTimeout() functions

the ratio between

keywords and words

the number of built-in

functions used for 18

deobfuscation

the entropy of the strings

declared in the script

the entropy of the script

as a whole

the number of long

strings(>40)

the maximum entropy of

all the script’s strings

the probability of the

script to contain 23

shellcode

the maximum length of

the script’s strings

the number of string

direct assignments

the number of string

modification functions

the number of event

attachments

the number of suspicious

strings

SVM, which creates a feature space with the attributes in the

training dataset, is to search a decision boundary or an

optimal hyperplane to separates the feature space with the

maximum interval,as shown in Fig.1.

There are two types of SVM. The linear SVM which

separates the data points with a linear boundary and the non-

linear SVM which separates the data points with a nonlinear

boundary.

In the case of linearly separable problems, it is easy to

find the plane in the feature space that separate two types of

samples. Therefore, our optimal plane is the one that has

maximum geometry interval. As the following formulas

shows:

min||ω||

s.t.,y

(ω

⋅

≥

1,i



Obviously, it's a convex quadratic programming

problems. To solve this problem, firstly, the Lagrange

function should be brought in to turn it to its dual

problem,.The slack variable and penalty function are

proposed to deal with linearly inseparable problem caused by

noise. Then the objective function convert to:

min||w||



s.t.,y

≥

−

……

≥

0,i

……

，n

Linear SVM performs well on datasets that can be easily

separated by a hyper-plane into two parts. But sometimes

datasets are complex and are difficult to classify using a

linear kernel. Non-linear SVM classifiers can be used for

the number of classid

such complex datasets.

In the non-linear case, it maps the data into a high

the number of parseInt and

dimensional space, where an optimal separating hyperplane

fromcharcode

would be found. With appropriate mapping function, most of

the ratio between

the non-linear problem can be transformed into the linear

n and line

problem in high-dimensional space. However, the high-

dimensional mapping also brings the curse of dimensionality,

the number of chars in hex

and it is a disaster to calculate separating hyperplane in the

feature space. The inner product can be realized in the

the number of

feature space with kernel function satisfies Mercer, which is

CreateObject,ActiveXObject

a trick to this problem:

max



−



(

)

III. M

ALICIOUS SCRIPT DETECTION BASED ON

SVM

≥





The machine learning technology,SVM, which could

help summarize the knowledge of identifying known

Common kernel functions are polynomial kernel,

malicious JavaScript, carry out a similarity search to find

Gaussian kernel, Sigmoid kernel function. Gaussian kernel is

unknown malicious JavaScript, with a high detection rate

a universal nuclear function, by selecting the appropriate

and low false alarm rate [11].

parameters, it can achieve a high correct rate. Gaussian

kernel:

A. SVM

(

)

exp(

−

⋅

−

SVM (Support Vector Machine), originated in statistical

learning theory by Vapnik et al in 1995, was focused on

pattern classification problems [12]. It is a statistical learning

algorithm that classifies the samples using a subset of

training samples called support simple terms,

Published by Atlantis Press, Paris, France.

0215

IV. E

XPERIMENTAL ANALYSIS AND IMPLEMENTATION

A. Experimental Analysis

The experimental data is composed of 1000 malicious

JavaScript collected from VX Heavens [13] and 1000 benign

ones from reputable sites. The dataset is divided into three,

one third as the training set and two thirds as the test set.

According to the analysis previously,we extract 27

features of the dataset, scale on the extracted features, and

converts it into WEKA file format.

The above shows that , SVM obtains more than

90% both on accuracy and recall, and the accuracy on the

Figure 1. Optimal hyperplane

training set even raised to 93.8% . SVM shows a better

accuracy even in the case of less training samples.

In this paper, we choose the RBF kernel to get the best

B. The malicious script analysis framework based on SVM

classification model. Two parameters would be adjusted, the

As mentioned before, the script analysis can be divided

penalty factor C and kernel function parameter γ.

into static analysis and dynamic analysis. Here, we propose

C is used to weigh the "Find largest interval hyperplane"

an SVM-based static analysis method, combined with

and "make sure minimum deviation of the data points", C set

machine learning classification techniques, to distinguish

large value easily causes overlearning, and reduceing the

malicious scripts and benign script. Its script training

generalization performance. When set small value, it results

flowchart and script test flow chart are shown in Fig.2.

in less learning, which all the sample are classified into the

a) Dataset preparation: collect enough malicious

strong class. γ stands for the nuclear radius, directly impacts

JavaScript and benign JavaScript from the site.

the classification performance of SVM. With too large value,

it will end in zero generalization ability, while with too small

b) Data cleaning: cleaning the sample data, such as the

removal of the Notes, excess carriage return and line feed,

value, the classify ability of new samples close to zero,even

it has a high accuracy on the training set[14].

which increases the processing speed and accuracy.

The optimization algorithm, GridSearch on WEKA, is

c) Feature extraction: extract 27 features based on the

used in this paper to search the optimal

analysis above.

accurately rate as criterion, 1 as Step of C, γ steps as a base

d) Pre-treatment: data normalization processing, scaled

unit, and obtain the experimental results of . when C

to [0,1]. This process reduces the training error while the

= 27, γ= 4, the training set accuracy up to 96.59%. And get

data characteristic value is too large, or too small. Second,

the best optimization model parameter training.

the efficiency could also be improved.

As shown in , SVM gains higher accuracy on

the training set and a test set than ADTree and NaveBayes.

e) Parameter tuning: the WEKA is the platform to train

models. With a grid of binary classification SVM traverse

NaveBayes even don't obtains 90%, while SVM has an

GridSearch algorithm and ten-fold cross-validation,it selects

accuracy of 94.38% on the test set. It is clear that the SVM is

better at handling binary class.

the best SVM model parameters.

These experimental results shows that, the static

f) Model training: training best SVM model to obtain

detection method based on SVM we proposed, is excellent

the optimal parameters.

both on the accuracy and detection efficiency.

g) The data prediction: using the best model to predict

TABLE II.

THE

WEKA

FILE OF

ORMALIZED EIGENVECTORS

the classification of the test set.

malicious

benign

average

TP FP

Rate Rate

0.912 0.038

0.962 0.088

0.937 0.064

PrecisiRecall F-

on Measure

0.958 0.912 0.934

0.919 0.962 0.940

0.938 0.937 0.937

Roc

Area

0.937

TABLE III.

1~128

ARAMETER OPTIMIZATION WITH

RID

EARCH

−

Optimal

parameter

C=27

γ=4

C=30

γ=1

C=8

γ=5

Accuracy of

training set

Accuracy of

test set

96.59% 94.38%

95.48% 95.46%

96.48% 93.38%

−

Figure 2. The flowchart of malicious JavaScript

1~128

−

Published by Atlantis Press, Paris, France.

0216

TABLE IV.

This paper proposed a SVM-based malicious JavaScript

detection method, which,based on fully analysis of scripting

Learning algorithm Accuracy of training set Accuracy of test set

language, extracts the static information of the script, and

ADTree

94.94% 91.68%

improves the detection efficiency and safety of the system,

NaiveBayes

86.36% 84.31%

without parsing and compiling the script; The SVM has a

96.59% 94.38%

SVM

good reputation in the practical application of machine

learning, and helps detect unknown attacks. The

experimental results show that this method has a high

B. System implementation

accuracy and low false alarm , and could detect unknown

attacks.

The implementation of prototype system for the detection

of malicious JavaScript is introduced in this section. The

CKNOWLEDGMENT

system can directly detect a JavaScript script, or deal with a

URL to detect the JavaScript the page contains.

This research was supported by grant R1090569 and

The module of feature extraction and SVM detection is

LY12F02039 from the Natural Science Foundation of

developed by C, while the script extraction is by PHP. As

Zhejiang Province.

shown in Figure 3.

EFERENCES

1) Script extraction module:

This module is developed for the user as interface,

[1]

Hao Zhang, Ran Tao, Zhiyong Li, Hua DU , The Detection

Methods of Malicious Script . Ordnance Academic Journal,

provides services of script detection . Users can either choose

2008.

to upload a JavaScript script, or submit a URL address. This

module will analyze the page, and extract the JavaScript and

[2]

Ming Zhu, Qian Xu, Chunming Liu. The Analysis And

then package to feature extraction module for further

Detection Of Trojan, Computer Engineering and Applications,

2003.

analysis.

2) Feature extraction module:

[3]

Oystein Hallaraker, Giovanni Vigna, Detecting Malicious

JavaScript Code in , 2005.

Firstly,this module would accept the JavaScript from the

last one, do data cleaning and remove extra blank lines,

[4]

Min Dai, Ya-Lou Huang, Wei Wang, Trojan detection Model

comments and so on; Then extract 27 feature previously

Based On Static File Information. Computer Engineering,

mentioned; At last, the data is scaled to [0,1] to improve

2006,3 (6): 176 - 179.

computational efficiency, and converted into the standard

[5]

D. Flanagan. JavaScript: The Definitive Guide, 4th

form of the next detection module.

er 2001.

3) SVM detection module

[6]

Yinhe Zhang, Wenxin Liang, Xinlei Li, Self-study manual of

The model used here,SVM, is trained with optimal

JavaScript . Tsinghua University Press,2008-10.

parameters. It accepts a standard data from feature extraction

[7]

Bin Liang, Jianjun Huang, Fang Liu, Dawei Wang, Daxiang

module. Detected by SVM, the results are then delivered to

Dong, Zhaohui Liang. Malicious Web Pages Detection Based

display in the script extraction module.

on Abnormal Visibility Recognition. IEEE, 2009.

[8]

V. Anupam and AJ Mayer. "Secure Web Scripting". IEEE

HE COMPARISON OF THE ACCURACY OF TRAINING SET

AND TEST SET OF

SVM,

ADT

REE

AND

AVE

AYES ALGORITHM

V. C

ONCLUSIONS

Internet Computing, 1998, 2 (6) :46-55.

[9]

Likarish P., Jung E., Jo I. Obfuscated malicious JavaScript

detection using classification techniques. IEEE :47-54.

[10]

Byung-Ik Kim, Chae-Tae Im, Hyun-Chul Jung. Suspicious

Malicious Web Site Detection with Strength Analysis of a

JavaScript ational Journal of Advanced

Science.2011.

[11]

Xiaokang Zhang. Malicious code detection technology based

on data mining and machine learning research [D]. Master's

degree thesis of USTC .2010.

[12]

Vapnik VN The nature of statistical learning theory [M].

Springer Verlag, 2000.

[13]

VX Heavens. Http:/// [EB / OL]. 2006-09-28.

[14]

Xiaofei Yan, Hongwei Ge, Sheng Yan, RBF kernel SVM and

Its Application, Computer Engineering and Design, 2006.

Figure 3. The implementation of prototype system

Published by Atlantis Press, Paris, France.

0217

本文标签：检测支持恶意说明书方法

版权声明：本文标题：静态恶意JavaScript检测:支持向量机(SVM)方法说明书内容由热心网友自发贡献，该文观点仅代表作者本人，转载请联系作者并注明出处：https://www.elefans.com/dongtai/1718214505a654121.html，本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌抄袭侵权/违法违规的内容，一经查实，本站将立刻删除。

电子爱好者 - 最新技术资讯及电子产品介绍！

静态恶意JavaScript检测:支持向量机(SVM)方法说明书

更多相关文章

google chrome谷歌浏览器降级禁用自动更新方法

无法搜出共享打印机的计算机名,Win10搜不到共享打印机怎么回事？Win10搜不到共享打印机的处理方法...

win10共享打印机怎么设置_win10和win7共享打印机设置方法

win7用计算机名共享打印机,win7局域网打印机共享设置（win10共享打印机设置方法）...

计算机打印设置方法,针式打印机设置教程详解，小编教你针式打印机设置共享纸张大小方法...

网络 启动计算机,电脑设置网卡启动方法步骤

CAD安装失败怎样卸载重新安装CAD，解决CAD安装失败的方法总结

u盘数据恢复的原理_数据恢复：手把手教你六种U盘数据恢复方法

U盘数据丢失怎么办？U盘数据丢失恢复方法？

黑客攻防专题四：常用的九种攻击方法

ios迅雷php格式,2019最新最全iOS迅雷文件提取方法

安卓恶意应用识别（一）（Python批量爬取下载安卓应用）

打造前端MAC工作站（二）安装软件的两种方法

arduino IDE 安装 esp8266开发板错误的终极解决办法，此方法百分十九十能解决值得一试。

【RStudio】【安装&amp;卸载程序包】【四种方法】

Protobuf 编译器 Windows 安装方法

Windows中BeautifulSoup的安装方法

将Windows Server 2012改造成桌面操作系统（Windows 8)的方法

chrome浏览器获取一个要下载文件的地址的方法

解决Chrome浏览器和Edeg浏览器主页被篡改的方法

发表评论

推荐文章

win10修改user用户名，完美解决，亲试无bugs

谷歌浏览器在PC端登录抖音官网时，登录页面不显示二维码的解决方法

支持chrome edge谷歌浏览器在线WEB 网页页面 打印 条形码

基于51单片机WIFI无线遥控防盗电子密码锁APP控制门禁设计46

软件测试必掌握技能(黑盒测试，白盒测试......)

热门文章

android5.1显示工具栏,讯飞输入法5.1.1805定制专属工具栏

云电脑用流量玩快吗？

android 百度输入法,Android系统预置百度输入法

大话C语言游戏（基础篇）——三子棋游戏实现超详细剖析及优化建议

开源免费3D CAD软件：FreeCAD

码云的使用教程（Windows系统下）

命令行运行Windows更新

Deepin下一键安装windows所有字体

怎样把任意exe程序注册成windows系统服务(手动注册服务)

安全加固实施

最新文章

C++菜鸟教程 - 从入门到精通 第一节

小米计算机无法清除,小米电脑清空回收站找回【完整解决方案步骤】

VC++ 获取系统硬件相关信息(附源码 )

检测移动端设备信息（手机品牌、系统版本等 或 PC

智慧金融系统软件需求规格说明(3.20终版)

微信功能版（可用于电脑、安卓手机端）微信电脑版 使用说明

win10正式版新功能介绍

FL Studio2023最新版编曲音乐制作数字音频软件

【历史上的今天】3 月 14 日：微软发布 IE9；黑莓创始人出生；圆周率计算创造新纪录

Windows Phone学习系列教程

快速上手: Linux环境配置, 基本指令与项目部署要点

【Windows mobile】Windows Mobile开发环境搭建指南（转）

系统版本 刷机 介绍

Google Pixel 解锁BL、刷入Twrp、magisk Root、安装 Xposed

【网络安全 --- 任意文件上传漏洞靶场闯关 6-15关】任意文件上传漏洞靶场闯关，让你更深入了解文件上传漏洞以及绕过方式方法，思路技巧

小米手机肿么还原时钟

15000流明是多少瓦

一般普通投影机功率多大?

苹果绿联转换器有些投影机不能用

坚果V9投影机具体参数?

有关九年级作文850字精选

80后90后_高一作文

中级卫生专业资格中医全科学主治医师中级模拟题2021年(9)案与解析

(精品)师范大学招考硕士研究生课程八六0试卷

ZXMVC8900(V3

【模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313】模拟人生4（The Sims 4）性感露背黑色亮片礼服MOD V20190313 官方免费下载

【生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD】生化危机2：重制版（Resident Evil 2 Remake）克莱尔红头发深色服装MOD 官方免费下载

【模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311】模拟人生4（The Sims 4）性感露背深V领吊带裙MOD V20190311 官方免费下载

【模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311】模拟人生4（The Sims 4）科幻风宇宙飞船家庭住宅MOD V20190311 官方免费下载

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改 官方免费下载

如何实现高效的treenode搜索算法

treenode与链表有何本质区别

在哪些场景下应优先考虑使用treenode

treenode在树形结构中的角色是什么

如何通过treenode实现二叉树

网络启动计算机,电脑设置网卡启动方法步骤

【RStudio】【安装&卸载程序包】【四种方法】

支持chrome edge谷歌浏览器在线WEB 网页页面打印条形码

C++菜鸟教程 - 从入门到精通第一节

检测移动端设备信息（手机品牌、系统版本等或 PC

微信功能版（可用于电脑、安卓手机端）微信电脑版使用说明

系统版本刷机介绍

【鬼泣5（Devil May Cry V）v1.0十四项修改】鬼泣5（Devil May Cry V）v1.0十四项修改官方免费下载