High Performance Computing for Machine Learning Intrusion Detection Based on Nonnegative Matrix Factorization

Project: Other project

Project Details

Description

With the advent of connected object, Internet of things (IoT) and large scale data generated by internet, there is a high interest from the research community to handle this Big data and use Machine Learning (ML) algorithm to efficiently extract meaningful information (features) to help decision making almost in real-time. The volume of data is so big that could not be stored in a single computer and there is a lack of distributed ML algorithm to process large scale datasets. Nonnegative Matrix Factorization (NMF) is an approximation method for an input matrix A ? WH is a useful unsupervised Machine Learning technique widely use in data mining for dimension reduction (low rank factors W and H), clustering and factor analysis [1, 2]. Similar to Principal Component Analysis (PCA) which requires orthogonality for PC separation, in NMF we require that both the PCs stored in the matrix W and their weights or coefficients stored in the matrix H to be nonnegative. For many real-word data the nonnegativity is inherent and the interpretation of factors has a natural interpretation which could be one of the advantages compared to PCA may have cancelation effects because negative coefficients. MNF is used in wide range of applications such that text mining [3], computer vision, bioinformatics [4], community detection in social networks just to name a few, the order of millions of row for the input matrix is not exceptional. Formally, MNF problem is to find two low rank factors W an mxk nonnegative matrix and H an kxn nonnegative matrix for a given input nonnegative matrix A m?n, such that A ? WH, and that satisfying minw>=0,H>=0||A ? WH||F, where F is the Frobenus norm [5]. Typically, most of the available factorization methods such as Multiplicative Updates (MU) [5], Hierachical Alternative Least squares HALS [6,7], Stochastic Gradiant Decent [8], Block Principal Pivoting (ALNS-BPP) [8], are based on alternating optimizing W and H while keeping one fix [9]. As we mentioned, the adjacency matrix as an example for social network detection is of order billions of nodes. A distributed memory High Performance Computer and an efficient distributed MNF algorithm is need for both storage and computing requirements. In this project we aim at implementing a distributed MNF based on Block Principal Pivoting and study its extension to MU or HALS to explore the inherent parallelisms within those methods. MHF methods are iterative techniques based on matrix to matrix multiplication which may involve collective communications. Using parallel models based on Map reduce (Hadoop) offers simplicity of implementation but they will be expensive in term of performance, in each iteration, data need to be read and write to disk in addition to communication. Our objective is to develop an efficient Message Passing Interface (MPI) using our LUBAN HPC for MNF taking into account the locality, task load balancing, and reducing the communication. Our target application is instruction detection as nowadays the is a huge number and varied types of attacks in the internet. The parallel or distribute MNF algorithm will be used to find and classify normal network traffic and the different classes of attack with will be controlled by the rank k.

Layman's description

With the advent of connected object, Internet of things (IoT) and large scale data generated by internet, there is a high interest from the research community to handle this Big data and use Machine Learning (ML) algorithm to efficiently extract meaningful information (features) to help decision making almost in real-time. The volume of data is so big that could not be stored in a single computer and there is a lack of distributed ML algorithm to process large scale datasets. Nonnegative Matrix Factorization (NMF) is an approximation method for an input matrix A ? WH is a useful unsupervised Machine Learning technique widely use in data mining for dimension reduction (low rank factors W and H), clustering and factor analysis [1, 2]. Similar to Principal Component Analysis (PCA) which requires orthogonality for PC separation, in NMF we require that both the PCs stored in the matrix W and their weights or coefficients stored in the matrix H to be nonnegative. For many real-word data the nonnegativity is inherent and the interpretation of factors has a natural interpretation which could be one of the advantages compared to PCA may have cancelation effects because negative coefficients. MNF is used in wide range of applications such that text mining [3], computer vision, bioinformatics [4], community detection in social networks just to name a few, the order of millions of row for the input matrix is not exceptional. Formally, MNF problem is to find two low rank factors W an mxk nonnegative matrix and H an kxn nonnegative matrix for a given input nonnegative matrix A m?n, such that A ? WH, and that satisfying minw>=0,H>=0||A ? WH||F, where F is the Frobenus norm [5]. Typically, most of the available factorization methods such as Multiplicative Updates (MU) [5], Hierachical Alternative Least squares HALS [6,7], Stochastic Gradiant Decent [8], Block Principal Pivoting (ALNS-BPP) [8], are based on alternating optimizing W and H while keeping one fix [9]. As we mentioned, the adjacency matrix as an example for social network detection is of order billions of nodes. A distributed memory High Performance Computer and an efficient distributed MNF algorithm is need for both storage and computing requirements. In this project we aim at implementing a distributed MNF based on Block Principal Pivoting and study its extension to MU or HALS to explore the inherent parallelisms within those methods. MHF methods are iterative techniques based on matrix to matrix multiplication which may involve collective communications. Using parallel models based on Map reduce (Hadoop) offers simplicity of implementation but they will be expensive in term of performance, in each iteration, data need to be read and write to disk in addition to communication. Our objective is to develop an efficient Message Passing Interface (MPI) using our LUBAN HPC for MNF taking into account the locality, task load balancing, and reducing the communication. Our target application is instruction detection as nowadays the is a huge number and varied types of attacks in the internet. The parallel or distribute MNF algorithm will be used to find and classify normal network traffic and the different classes of attack with will be controlled by the rank k.
AcronymTTotP
StatusNot started

Keywords

  • Big Data
  • High Performance Computing
  • Machine Learning
  • Intrusion Detection
  • Internet of Things (IoT)

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.