Abstract

Recently, some graph-based methods have been proposed for malware detection. However, current malware is generally characterized by sophisticated behaviors, which makes graph-based malware detection extremely challenging. To address this issue, we propose a graph repartition algorithm by transforming API call graphs into fragment behaviors based on programs’ dynamic execution traces. The proposed algorithm relies on the N-order subgraph (NSG) for constructing the appropriate fragment behavior. Moreover, we improve the term frequency-inverse document frequency- (TF-IDF-) like measure and information gain (IG) to extract the crucial N-order subgraph (CNSG). This novel behavioral representation and improved extraction method can accurately represent crucial behaviors of malware. Experiments on 4,400 samples demonstrate that the proposed method achieves a high accuracy of 99.75% in malware detection and promising performance of 95.27% in malware classification.

1. Introduction

Malware refers to any software that aims at damaging or infiltrating computer systems [1]. The fast-growing malware variants pose a serious threat to malware detection. According to Symantec’s 2018 Internet Security Threat Report (ISTR), the number of malware variants reached 669,947,865 in 2017, doubling that of 2016 [2]. Moreover, the advent of new technologies has contributed to the increasing complexity of malware. Facing numerous and sophisticated malware variants, malware detection is urgently needed (e.g., see [35]).

Among the existing methods, malware detection is mainly divided into static and dynamic analysis methods [6]. Static analysis methods are processes of analyzing instructions and structures to confirm program functions [7]. They do not need to run malware directly. Unfortunately, static analysis methods are sensitive to sophisticated obfuscation instructions and encryption techniques. Aiming at the shortcomings of the static analysis methods, dynamic analysis methods are proposed. The advantage of dynamic methods is that they observe behaviors by running programs in a virtual environment. Dynamic methods commonly address the threat of statically obfuscated malware [8] and encryption techniques. Obfuscated malware samples can change their code syntax while preserving their semantics [9]. Dynamic analysis is an effective way to recognize malware behavior. For example, when we want to analyze the keylogger, dynamic analysis can help us find the keylogger’s log file and trace the information.

The program generally relies on Application Programming Interface (API) calls provided by the operating system to accomplish its functions. Hence, a program’s execution trace can be obtained by monitoring the stream of API calls [10]. The API call graph is constructed by tracing API calls and their arguments, which is an effective method to indicate program behavior [11]. Considerable effort has been expended to identify malware by using the API call graph [12, 13]. With the advent of sophisticated malware samples, the API call graph is becoming more and more complex [14, 15]. The major issue facing malware detection is computational complexity [16, 17] of graph matching. Moreover, it is a great challenge to construct a graph that is general enough to classify malware.

Our method is different from previous methods since it solves one major shortcoming. We propose a novel method that divides the API call graph into fragment behaviors. Moreover, we extract crucial behaviors by applying the term frequency-inverse document frequency- (TF-IDF-) like measure [18, 19] and information gain (IG) [20, 21]. Finally, we utilize Random Forest (RF) [22], Support Vector Machine (SVM) [23], Decision Tree (DT) [24, 25], and K-Nearest Neighbor (KNN) [26] for malware classification. We aim to enhance the malware classification performance by constructing an appropriate behavioral representation of the malware family. The main contributions of this method are as follows:(1)We propose a graph repartition algorithm to extract fragment behaviors from original API call graphs. The extracted fragment behavior is a graph-based API sequence that preserves the dependency of the API call graph.(2)We extract crucial behaviors by improving the feature extraction measure. The improved extraction measure which combines TF-IDF and IG shows great advantages in malware classification.(3)The proposed method achieves promising performance in both malware detection and classification. The experimental results demonstrate that the extracted crucial behaviors can accurately describe malware activities.

The rest of this paper is organized as follows: Section 2 reviews the related work. Section 3 introduces some basic notations. Section 4 represents the proposed method which consists of system overview, graph repartition, and feature extraction module. Experiment and evaluation are described in Section 5. The limitations of the proposed method are discussed in Section 6, which is followed by the conclusion in Section 7.

Programs generally perform various activities by taking advantage of different predefined API calls. API calls provide valuable information to identify potential exceptions and malicious activities. A considerable amount of researchers has been devoted to the research of the API call sequence. Eskandari et al. [27] proposed a dynamic malware detection system that explores system call via API call. In addition, they extracted API calls from the log file and used the n-gram to generate 4-gram API call sequences. Lee et al. [28] utilized the Cuckoo sandbox to execute programs dynamically. They extracted API behavior data and transformed API calls into sequences by using the n-gram method. Moreover, they calculated the frequency of sequences. After that, the cosine similarities of API sequences were calculated among different programs. Finally, the malware samples which are similar to each other were grouped.

Hansen et al. [29] presented a scalable dynamic analysis method by injecting programs into parallel virtual environments. The parallel virtual environment is implemented by developing the setup of the Cuckoo sandbox. They extracted labels and features from samples. The extracted features consist of API calls and their input arguments which include registry and DLLs. After that, they proposed two representation methods for malware detection and classification.

As mentioned above, the common parts of API call sequences can be utilized to identify the similarities of malware samples. The sequence-based approach is relatively simple to describe malware behavior. However, sequence-based methods only preserve temporal information between API calls, which are vulnerable to reorder or irrelevant API calls. Some methods have been proposed to address the drawbacks of sequence-based methods, such as deep learning-based models and more comprehensive feature representation.

Amin et al. [30, 31] explored bidirectional long short-term memory for building an antimalware system to detect static opcodes of malware. In addition, they designed a deep learning model of generative adversarial networks to detect Android malware.

D’Angelo et al. [32] transformed API call sequences which are invoked by apps during their execution to API-images. They autonomously extracted the most representative and discriminating features by applying autoencoders. The deep learning-based model shows great advantages in malware detection.

On the other hand, the API call graph is proposed to capture comprehensive relations (such as argument dependency) between API calls [33]. Park et al. [34] monitored the execution of programs and then constructed weighted directed behavioral graphs that represent kernel objects, object attributes, and dependencies between kernel objects. In addition, they proposed a method to generate a common behavioral graph by clustering individual behavioral graphs.

Elhadi et al. [11] presented a static analysis system; the proposed system read samples and then extracted API calls and their parameters under a secure environment. They classified API call graphs based on sequence dependence, data dependence, declaration dependence, and API dependence. For each kind of dependence, they constructed an API call graph. Finally, they integrated four kinds of API call graphs into an integrating API call graph and calculated the similarity between graphs.

Nikolopoulos and Polenakis [35] proposed a graph-based model based on dynamic taint analysis. The proposed model is constructed by exploring main properties of system-call dependency graphs. They adopted the Euclidean distance-based Δ-similarity metric for malware detection and the SaMe-NP similarity metric for malware classification.

Programs generally accomplish tasks by executing similar behaviors or repeating behavior multiple times. More similar or repeated behaviors occur, and more duplicated nodes or subgraphs appear. The drawback arises with sophisticated behavior which results in high dimensional features and brings more calculations [36]. Furthermore, it is unsatisfied that a behavioral graph is too specific which may ignore the minor changes in malware variants [37]. Likewise, not specific enough of the behavior graph commonly leads to benign samples judged as malware. A large number of work have been concentrated on investigating accurate approximation methods for these problems.

Fredrikson et al. [37] mined significant behaviors from samples based on the data dependence graph. The mined significant behaviors are then utilized to synthesize an optimally discriminative specification based on concept analysis and simulated annealing [38] algorithm. The focus of this proposed method is to reduce the size of the graph [39].

Alam et al. [40] put forward the “Annotated Control Flow Graph” and “Sliding Window of Difference and Control Flow Weight” to reduce the effects of obfuscations. The proposed Annotated Control Flow Graph provides a quick graph matching method by dividing itself into many smaller Annotated Control Flow Graphs. The proposed Sliding Window of Difference and Control Flow Weight captures the semantics of the control flow and helps in malware detection.

Ding et al. [41] constructed an API dependency graph by tracing taint data. After that, they proposed a dependency graph pruning algorithm for pruning a dependency graph. Finally, they constructed a common behavioral graph based on the pruned dependency graph. The proposed common behavior graph prunes similar and repeated behaviors.

We provide a comprehensive summary of malware detection and classification work in Table 1. To simplify the representation of the graph, we propose a novel graph repartition algorithm. The proposed algorithm constructs fragment behaviors that describe crucial activities of the malware. The novel and simplified representation of fragment behavior preserves the dependency of the API call graph and effectively avoids problems in graph matching. This novel behavioral representation is designed to provide a better malware classification performance.

3. Basic Notation

We explain some notations in this section: subgraph, N-order subgraph, crucial N-order subgraph, and TF-IDF.

API call graph commonly represented by a directed acyclic graph which consists of nodes and edges. If an API call A is associated with API call B, an edge is established from node A to node B. That is, edges represent dependencies among different types of nodes (e.g., network, registry, and file system).

API call graph defines specific behaviors. We annotate root and leaf nodes with labels in each API call graph. After that, we extract the full execution paths from the root node to leaf nodes in an API call graph. These no-branching execution graphs extracted from API call graph are represented as subgraphs in our system.

N-order subgraph is extracted from the subgraph by sliding a window of size N.

Definition 1. (N-Order Subgraph (NSG)). NSG is a graph in which the maximum number of nodes does not exceed N.
NSGS stands for NSG set:where is the number of NSG in NSGS.
NSG with the crucial information is chosen as an indicator of malware. We call this crucial NSG.

Definition 2. (Crucial N-Order Subgraph (CNSG)). CNSG is a subset of NSGS, and it contains the crucial information of NSGS. It can be described as follows:where is the crucial coefficient of N-order subgraph .
TF-IDF is a numerical statistic in information retrieval. It reflects the importance of words in a document. TF refers to the number of times a given word appears in a document. IDF measures the general importance of words [42].

Definition 3. TF-IDF is the product of TF and IDF.
Given a document and document set which has documents, . The word in the dataset is represented as . TF-IDF is calculated as follows:where represents the frequency of the word in document , is the number of times the given word appears in a document , and represents the dimension of the document . reflects the inverse document frequency of the word in document set , and is the number of the documents which contain .

4. The Proposed Method

4.1. Malware Classification System Overview

Our method consists of three parts: graph repartition, feature extraction, and malware classification. The whole process of the proposed system is outlined in Figure 1. Graph repartition consists of two modules: subgraph construction module and NSG construction module. Subgraph construction module extracts subgraphs from API call graphs which are constructed based on the registry, filesystem, process, services, network, and synchronization. NSG construction module extracts NSG from the subgraph construction module. Our goal is to build the appropriate behavioral representation and extract CNSG by using the improved TF-IDF-like measure in feature extraction. In the last step, RF, SVM, DT, and KNN are used for malware classification. The following are the steps of our proposed system.

Step 1. Extract subgraphs
We extract subgraphs from API call graphs of malware and benign samples. Icons with different shapes and colors represent various API calls in Figure 1. We can see that four API call graphs are listed in different rectangles. The subgraph construction module extracts five different subgraphs from four API call graphs.

Step 2. Build fragment behavior of NSG
NSG is obtained through an API call repartition algorithm based on the sliding window. We illustrate 3SG and 4SG in Figure 1. Icons that are not in the shadow refer to the parts that need to be discarded. NSG preserves more complex semantic information than API sequences, which contains the dependencies of API call graphs.

Step 3. Extract crucial behavior of CNSG
We adopt the TF-IDF-like measure and IG to calculate the crucial coefficient of NSG. The NSG with the higher crucial coefficient is selected as the significant behavior (e.g., CNSG) in our method.

Step 4. Malware classification
For each program analyzed in Cuckoo sandbox, we use some classifiers (e.g., RF, SVM, DT, and KNN) to identify whether the program is benign or malware. We obtain the appropriate CNSG in this process by comparing the performance of the experiments.

4.2. Malware Classification System Overview

In this section, we propose a graph repartition algorithm that reconstructs the API call graph to NSG. The purpose of the proposed algorithm is to build the appropriate fragment behavior of malware families by pruning similar behaviors.

Figure 2 shows the trace extracted from the log file generated through the Cuckoo sandbox. This is part of the input that the API call graph is built from. In line 1, one can see that the malware creates and opens a registry. After that, it repeatedly retrieves and sets the data. On lines 7 and 8, the malware creates a file and changes its information. In line 9, the program retrieves the information of the file. It closes the file on line 10. On lines 11 to 13, the program creates, retrieves, and closes another file.

In Figure 3, , , and are three API call graphs from line 1 to line 6, line 7 to line 10, and line 11 to line 13 in Figure 2, respectively. As illustrated in Figure 3, API call is applied to construct the node of a graph and the arguments are utilized to connect two API calls based on dependencies. For example, API call of line 1 in Figure 2 is labeled as RegCreateKey (Handle => 0x0000044c, Registry => 0x80000001, SubKey => …proxyTool). The value 0x0000044c of Handle is used to connect the RegQueryValue on line 2. The details of API call graph construction are described in our previous work [43]. It is necessary to extract crucial behaviors from the API call graph for malware classification.

For each API call graph, we first identify the root and leaf nodes. The root node is a node with no input information, and the leaf node is the node whose output is null in our system. Also, we need to extract subgraphs from the established API call graph. The extracted subgraphs are simple no-branching graphs that start at the root node and end with the leaf node. We obtained all subgraphs as follows.

When there is only one branch in the API call graph, the extracted subgraph is the same as the API call graph. The subgraph of is{RegCreateKey, RegQueryValue, RegSetValue, RegQueryValue, RegSetValue, RegQueryValue}

There are multiple branches in and . In this case, we need to extract different no-branching subgraphs based on root and leaf nodes.

Subgraphs of are explained as follows:{NtCreateFile, NtSetInformationFile}{NtCreateFile, NtQueryInformationFile}{NtCreateFile, NtClose}

Subgraphs of are explained as follows:{NtCreateFile, NtQueryInformationFile}{NtCreateFile, NtClose}

We divide the subgraph into fragment behavior through a sliding window. The behavior in a sliding window is a fragment behavior. The fragment behaviors are a set of behaviors that can accomplish a part or a certain function. The extracted fragment behavior is represented by NSG in our method. The size of the sliding window determines the maximum number (N) of NSG.

We show the process of some NSGs extracted from the subgraph of in Figure 4. When N = 3 (3SG), SG1 in the first window is the first 3SG extracted from the subgraph. The sliding window of size 3 slides from top to bottom at intervals of 1. Different colors of the window reflect the moving trail of the sliding window. We can see that the fragment behavior in the fourth window is the same as the fragment behavior in the second window. Hence, three unique 3SGs of SG1, SG2, and SG3 are extracted from the subgraph:SG1: {RegCreateKey, RegQueryValue, RegSetValue}SG2: {RegQueryValue, RegSetValue, RegQueryValue}SG3: {RegSetValue, RegQueryValue, RegSetValue}

For and , we notice that the number of nodes in each subgraph is smaller than 3 when we want to build 3SG. In this condition, the subgraph is regarded as a 3SG, which does not need to be divided. When all NSGs are extracted from all subgraphs of a program, we obtain NSGS. This set is used to represent the program behavior. The combination of NSGs contains the complete semantic information of a program’s API call graph. Our goal is to describe malware with an appropriate fragment behavioral representation by searching for N.

Algorithm 1 describes our proposed API Call Repartition Algorithm. The input of this algorithm is the API call graph (). For an API call graph , we first search for the root and leaf nodes. Lines 3 and 4 describe the root node and leaf nodes in an API call graph. We extract subgraphs from an API call graph in line 5. The extracted subgraphs start from the root node and end with leaf nodes. It is worth mentioning that the simple paths from the root node to a certain leaf node in an API call graph may occur more than once. For each subgraph in line 8, if the order of a subgraph is smaller than N, NSG is the same as the subgraph. Otherwise, we should extract the appropriate NSG based on the sliding window. The output of this algorithm is NSG. In this algorithm, we transform the original API call graph into fragment behaviors NSG.

Input: API call graph ()
Output: //
(1)Begin
(2)For to do // represents API call graph in a sample
(3)  Find the root node in
(4)  Find leaf nodes () in
(5)  Extract subgraphs from to leaf nodes
(6)End
(7) Obtained all extracted subgraphs
(8)For to do // represents subgraph and
(9)  If the order in is smaller than
(10)   
(11)  Else
(12)   Divide into
(13)  End
(14)End
(15) Output
(16)End

The subgraph is a no-branching fragment behavior extracted from API call graphs. Hence, the number of subgraph is no less than the number of API call graph (). In the same way, . An important problem of this issue is the value of N. Our goal is to better describe malware with an appropriate N as small as possible.

API call repartition algorithm removes two types of similar behaviors: internal similarity and external similarity. Internal similarity refers to the similarity of NSG in a subgraph. External similarity represents the similarity of NSG among different subgraphs. Internal similarity is generally caused by repeatedly executing API calls. For example, if a program repeatedly invokes RegQueryValue and RegSetValue in Figure 4, then two repeated 3SGs of {RegQueryValue, RegSetValue, RegQueryValue} are generated. In this condition, we select {RegQueryValue, RegSetValue, RegQueryValue} only once. External similarity is commonly caused by API calls that perform the same type of behaviors among different subgraphs. For example, NtCreateFile outputs two different values FileHandle 0x000000a0 and FileHandle 0x0000010c on lines 7 and 11, respectively. As a result, 3SGs of {NtCreateFile, NtQueryInformationFile} and {NtCreateFile, NtClose} are the same type of behaviors in different subgraphs. Therefore, we select {NtCreateFile, NtQueryInformationFile} and {NtCreateFile, NtClose} only once in the program. Removing similar behaviors can help to simplify the representation of the graph.

4.3. Feature Extraction

The characteristic of the program is represented as fragment behavior NSGS by applying the API call repartition algorithm which eliminates similarity behaviors. To remove unimportant ones, we need to calculate the crucial coefficient of NSG in NSGS. We propose a method that exploits the idea of TF-IDF and IG to evaluate the importance of an NSG.

We have four malware families and different types of benign samples in our proposed system. Different types of benign samples are defined as one family of benign. Hence, we have five categories; the category set is represented as , where . Each family has samples (in our proposed system, ).

TF-IDF’s main idea is that a fragment behavior NSG is appropriate for selecting as a crucial behavior when it appears with a high frequency (TF) in a category and appears with a low frequency in other categories. For IDF, NSG is appropriate for selecting as a CNSG when a fragment behavior appears in a small number of categories.

We consider that a fragment behavior NSG appears times in a category (). In addition, it appears times in other categories except for . Hence, fragment behavior appears () times altogether. NSG is a crucial behavior of when is large enough, which means that the value of is very high. However, the value of is relatively small because of the large ().

We present the improved TF-IDF-like measure by applying IG which is described in our previous work [20]. IG is defined as how much information the feature brings to the system. The more the information this feature brings to the system, the more important the feature is. The fragment behavior NSG appears times in the category and appears times in other categories except for . When is large enough, the value of is sufficient to select NSG as a crucial behavior.

Based on the TF-IDF and IG, we derive a symbolic expression for calculating the coefficient of NSG as follows:where represents the value of the TF-IDF-like measure and stands for the value calculated by IG. The improved TF-IDF-like method determines the effects of different factors of TF-IDF-like measure and IG by finding appropriate α (0 < α < 1). is the number of times appears in family , is the dimension of , and is the number of samples which contain .

5. Results

This section describes the dataset and the evaluation method in Section 5.1. Section 5.2 shows the experiment and evaluation results.

5.1. Dataset and Evaluation Method

To prove that our method is effective in detecting malware, a set of malware classification experiments are presented in this section. To ensure the fairness and effectiveness of the experiment, we selected the same amount of families which consist of Delf, Small, Zlob, and OBfuscated. To prevent confusion with obfuscated malware, we use the OBfuscated that begins with two uppercase letters to represent the Trojan.Win32.Obfuscated family. In addition, we download 880 benign samples from different websites. More precisely, benign samples consist of Desk Widget, Facebook Messenger, Google Earth, Matlab, Minclock, and Quicktime player.

Ubuntu is selected as the operating system to run a standard Cuckoo sandbox. First, we process malware samples in bulk by developing the Cuckoo sandbox. Each sample was executed several times. After a comprehensive analysis, the samples that performed malicious behaviors were selected for experimental analysis. As we all know, file-less malware can delete all the files it saves on the infected system disk, injects code into running processes, and uses PowerShell, Windows Management Instrumentation, and other technologies to make detection and analysis difficult. This antianalysis method can bypass hooks deployed in automated analysis sandboxes (such as Cuckoo sandbox). This article does not focus on file-less malware and other escape circumstance. Second, to ensure the fairness and effectiveness of the experimental results, we select 880 samples for each malware family and benign for experiments. Finally, we perform 10-fold cross-validation. In 10-fold cross-validation, we divide all dataset samples into ten parts. To guarantee the proportion of each family, we choose nine parts for training and the last part for testing each time. The experiments are repeated ten times and the accuracy is the average of the experimental results.

In our proposed malware classification method, TP, FN, FP, TN, TPR, and FPR in the formulas are defined in Table 2.

This definition uses Delf as an example. Delf is a malware family in our work.

TP represents the number of samples in which the sample belongs to Delf and is correctly classified as Delf.

FN is the number of samples in which the sample belongs to Delf but not classified as Delf.

FP indicates the number of samples in which the sample not belongs to Delf but classified as Delf.

TN indicates the number of samples in which the sample not belongs to Delf and is not classified as Delf.

The common performance of accuracy is defined as follows:

5.2. Experiment and Evaluation Results

RF, SVM, DT, and KNN are employed to evaluate the detection effectiveness of our method and to explore the impact of α in malware classification. We studied the effect of the value of α on different classifiers. We set the size of α from 0.1 to 0.9 to observe the effect of α on different classifiers. The effect of α on the accuracy of different CNSGs and classifiers is shown in Figure 5. The horizontal axis of Figure 5 indicates the value of α. The vertical axis of Figure 5 is the average accuracy we obtained from the 10-fold cross-validation.

We can see from Figure 5 that RF has good performance in malware classification based on behavioral fragment CNSG. In Figure 5(a), the average accuracy of RF is higher than of other classifiers for different values of α. The average accuracy of RF increases first and then decreases with the increase of α. The average accuracy of RF reaches the optimal value when α is equal to 0.7. In Figure 5(b), when the value of α is between 0.1 and 0.3, the average accuracy of SVM is optimal. When α is greater than 0.3, the average accuracy of RF is the best and is slowly increasing. The average accuracy of RF reaches the optimal value when α is equal to 0.9. In Figures 5(c) and 5(d), the average accuracy of RF is better than the other three classifiers, and the highest average accuracy is achieved when α is equal to 0.6 and 0.7, respectively.

It can be seen from Figure 5 that with the change of α, the average accuracy of CNSG classified by different classifiers has obvious changes. In other words, exploring changes in α has a positive impact on malware classification. α is an indispensable factor in malware classification. We can conclude that the IG can well compensate for the shortcomings of the TF-IDF-like measure when the optimal value of α is obtained.

To prove the validity of our improved TF-IDF-like measure, we compare the TF-IDF-like measure with our proposed method. Table 3 describes the average accuracy of the TF-IDF-like measure and the improved TF-IDF-like measure. We can see from Table 3 that with different classifiers and CNSGs, the improved TF-IDF-like measure is better than the TF-IDF-like measure, in most cases.

The experimental results also demonstrate that IG can make up for the deficiency of the TF-IDF in malware classification. When α is 0.9, C4SG has the highest classification accuracy (when the classifier is RF), which is as high as 95.27%. Based on the experimental results, we select C4SG as the final fragment behavior.

For malware detection, we select the optimal value of α obtained in malware classification. We draw a Receiver Operating Characteristic (ROC) curve in Figure 6. The horizontal axis of Figure 6 represents FPR, and the vertical axis of Figure 6 is TPR. The ROC curve reflects the correlation between FPR and TPR. It can be calculated in Figure 6 that the accuracy is as high as 99.7% with the FPR of 1.2%. The experimental results show that C4SG is promising in malware detection.

For malware classification, an example of the ROC curve is depicted in Figure 7. It illustrates the classification performance of C4SG detected by RF. Four pictures with the detection performance of Delf, OBfuscated, Small, and Zlob are presented. Figure 7(a) describes the detection performance of Delf. In Figure 7(a), we compare the ROC curve of API sequence (4 gram), C4SG, and subgraph. We can see from Figure 7(a) that the performance of C4SG is better than the subgraph and API sequence and the performance of the subgraph is better than the API sequence. Figure 7(b) describes the detection performance of OBfuscated. We can see from Figure 7(b) that both C4SG and subgraph obtained better detection performance than API sequence and C4SG is better than the subgraph. Figures 7(c) and 7(d) represent the detection performance of Small and Zlob, separately. In Figures 7(c) and 7(d), C4SG has better detection performance than the subgraph and API sequence, and the detection performance of the subgraph is better than the API sequence.

Subgraph and C4SG contain many API calls and their dependencies. Hence, the semantic in subgraph and C4SG is more abundant than in the API sequence. C4SG achieves a better detection performance than the subgraph. This effectively proves that the C4SG we built is suitable for malware classification.

For malware detection and classification, we make a comparison with some related models, i.e., Fredrikson et al. [37], Alam et al. [40], and Ding et al. [41] in Table 4. Our malware detection result shows good advantages in related studies. For malware classification, authors of [41] have surpassed our results; we take note that Delf, Small, and Zlob in our experiment have some of the same malicious behavior, which may be an important cause of the reduction in classification accuracy.

6. Discussion

We summarize the limitations of the system in this section. In addition, possible solutions are counseled on these limitations.

The main premise of our proposed malware classification method is that we observe malicious activities by executing Cuckoo sandbox. Sandbox is widely used for detecting malware in dynamic analysis. Nevertheless, certain malware samples can evade detection by analyzing the virtual environment to avoid executing malicious operations. In addition, malware writers can also use some methods (e.g., delays) to restraint malicious operations during analysis. Executing malware in multiple analysis environments is an effective way to detect evasive samples.

The dataset for malware analysis is relatively small. Larger numbers of malware samples may have better results. Therefore, more samples are needed to implement a large multiclassification. This is also the work we want to do in the future.

The proposed method is very promising for family classification, but there are miss predictions. In our experiments, some Delf samples are detected as Small and Zlob. The main reason for this misclassification is that they have some of the same malicious behavior. Delf generally downloads and runs files on designated IP and port, causing the malware to run automatically on remote hosts. Small usually infects a computer and connects to remote servers to download malware. Zlob is a Trojan that remotes access to infected computers unauthorized. That is to say, Delf, Small, and Zlob perform remote connection operations and have similar behaviors with each other. In addition, the graph-based sequence may have certain limitations. Therefore, in future work, we intend to explore the similarity of CNSG in the form of the graph.

To overcome the shortcomings of traditional detection models, we also need to explore some state-of-the-art modes, i.e., Amin et al. [30], Amin et al. [31], and D’Angelo et al. [32]. We will explore deep learning-based methods to improve the detection rate of malware.

The dataset for malware analysis is relatively small. Larger numbers of malware samples may have better results. Therefore, more samples are needed to implement a large multiclassification. This is also the work we want to do in the future.

7. Conclusions

In this paper, we propose a dynamic malware analysis method that relies on novel feature representation and extraction for malware classification. The proposed feature representation measure transforms malware behavior into fragment behavior. Moreover, the improved feature extraction measure is utilized to extract crucial behaviors of malware families. The experimental results show that the proposed C4SG achieves a promising performance of 95.27% detected by RF in malware classification.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was funded by the National Natural Science Foundation of China under Grant no. 61601041 and the Fundamental Research Funds for the Central Universities under Grant no. 2019PTB-003.