Wednesday, June 5, 2019
Machine Learning in Malware Detection
Machine Learning in Malw be Detection1.0 Background ResearchMalw ar was first becomed in 1949 by John von Neumann. Ever since then(prenominal), more and more malw atomic number 18s atomic number 18 created. Antivirus company be constantly looking for a method that is the some effective in detecting malw ar. One of the most famous method use by antivirus company in detecting malware is the sig nature based detection. But over the years, the growth of malware is increasing uncontrollably. Until recent year, the signature based detection bear been proven ineffective against the growth of malware. In this research, I shake chosen another method for malware detection which is implementing weapon learning method on to malware detection. utilize the dataset that I get from Microsoft Malware Classification Challenge (BIG 2015), I exit find an algorithm that will be able to detect malware effectively with low fictive imperious error.1.1 Problem StatementWith the growth of technol ogy, the number of malware are alike increasing day by day. Malware now are designed with mutation symptomatic which causes an enormous growth in number of the variate of malware (Ahmadi, M. et al., 2016). Not only that, with the help of machine-driven malware throwd tools, novice malware author is now able to easily generate a new variation of malware (Lanzi, A. et al., 2010). With these growths in new malware, traditional signature based malware detection are proven to be ineffective against the vast variation of malware (Feng, Z. et al., 2015). On the other hand, machine learning methods for malware detection are proved effective against new malwares. At the same time, machine learning methods for malware detection deal a high false positive rate for detecting malware (Feng, Z. et al., 2015).1.2 ObjectiveTo investigate on how to implement machine learning to malware detection in company to detection unknown malware. To develop a malware detection software that implement ma chine learning to detect unknown malware. To validate that malware detection that implement machine learning will be able to achieve a high accuracy rate with low false positive rate.1.3 Theoretical / Conceptual manakin1.4 SignificanceWith Machine Learning in Malware detection that have a high accuracy and low false positive rate, it will help end up user to be free from fear malware damaging their computer. As for organization, they will have their system and rouse to be more secure.2.0 Literature Re bewitch2.1 Overview traditionalistic security product uses virus scanner to detect vixenish code, these scanner uses signature which created by reverse engineering a malware. But with malware that became polymorphic or metamorphic the traditional signature based detection method used by anti-virus is no long effective against the watercourse issue of malware (Willems, G., Holz, T. Freiling, F., 2007). In current anti-malware products, there are dickens main task to be carried ou t from the malware digest process, which are malware detection and malware classification. In this paper, I am focusing on malware detection. The main objective of malware detection is to be able to detect malware in the system. There are two guinea pig of analysis for malware detection which are alive(p) analysis and static analysis. For effective and efficient detection, the uses of mark stub oution are recommended for malware detection (Ahmadi, M. et al., 2016). There are various type of detection method, the method that we are using will be detecting through hex and fable file of the malware. Feature will be extracted from both hex view and assembly view of malware files. After extracting feature to its category, all category is to be combine into one feature vector for the classifier to run on them (Ahmadi, M. et al., 2016). For feature selection, separating binary star file into blocks to be analyze the similarities of malware binaries. This will reduce the analysis ov erhead which cause the process to be faster (Kim, T.G., Kang, B. Im, E.G., 2013). To build a learning algorithm, feature that are extracted with the label will be undergo classification with using any classification method for example Random Forest, Neural Network, N-gram, KNN and many others, but adjudge Vector Machine (VCM) is recommended for the presence of noise in the extracted feature and the label (Stewin, P. Bystrov, I., 2016). As to generate result, the learning model is to test with dataset with label to generate a graph which indicate detection rate and false positive rate. To find the best result, repeat the process using many other classification and create learning model to test on the same dataset. The best result will the one graph that has the highest detection rate and lowest false positive rates (Lanzi, A. et al., 2010).2.2 Dynamic and electrostatic AnalysisDynamic Analysis runs the malware in a simulated environment which ordinarily will be a sandbox, then w ithin the sandbox the malware is executed and being observe its behavior. Two approaches for dynamic analysis that is comparing image of the system before and after the malware work, and monitors the malware action during the execution with the help of a debugger. The first approach usually give a report which will be able to obtain similar report via binary observation while the other approach is more effortful to implement but it gives a more detailed report about the behavior of the malware (Willems, G., Holz, T. Freiling, F., 2007).Static Analysis will be studying the malware without executing it which causing this method to be more safe comparing to dynamic analysis. With this method, we will dissemble the malware practicable into binary file and hex file. Then study the opcode within both file to compare with a pre-generated opcode profile in order to search for malicious code that exist within the malware executable (Santos, I. et al., 2013).All malware detection will be needed either Static Analysis or Dynamic Analysis. In this paper, we will be focusing on Static Analysis (Ahmadi, M. et al., 2016). This is because, Dynamic analysis has a drawback, it can only run analysis on 1 malware at a time, making the whole analysis process to take a long time, as we have many malware that needed to be analysis (Willems, G., Holz, T. Freiling, F., 2007). As for Static Analysis, it mainly uses to analyze hex code file and assembly code file, and compare to Dynamic Analysis, Static Analysis take practically short time and it is more convenient to analyze malware file as it can schedule to scan all the file at once purge in offline (Tabish, S.M., Shafiq, M.Z. Farooq, M., 2009).2.3 Features ExtractionFor an effective and efficient classification, it will be wise to extract feature from both hex view file and assembly view file in order to retrieve a complementary date from both hex and assembly view file (Ahmadi, M. et al., 2016).Few types of feature that are extracted from the hex view file and assembly view file, which is N-gram, Entropy, Image exercise, String Length, Symbol, Operation Code, Register, Application Programming Interface, Section, selective information Define, Miscellaneous (Ahmadi, M. et al., 2016). For N-gram feature, it usually used to break a sequence of action in diametric areas. The sequence of malware execution could be capture by N-gram during feature extraction (Ahmadi, M. et al., 2016). For Entropy feature, it extracts the probability of suspicion in a series of byte in the malware executable file, these probability of uncertainty is depending on the amount of information on the executable file (Lyda, R.,Hamrock, J,. 2007). For Image Representative feature, the malware binary file is being read into 8-bit vector file, then organize into a 2D array file. The 2D array file can be visualize as a black and gray image whereas grey are the bit and byte of the file, this feature look for common in bit recording in the malware binary file (Nataraj, L. et al., 2011). For String Length feature, we open each malware executable file and view it in hex view file and extract out all ASCII depict from the malware executable, but because it is difficult to only extract the actual string without extract other non-useful element, it is required to choose important string among the extracted (Ahmadi, M. et al., 2016). For Operation Code features, Operation code also known as Opcode are a type of instruction syllable in the machine language. In malware detection, different Opcode and their frequency is extracted and to compare with non-malicious software, different set of Opcodes are identifiable for either malware or non-malware (Bilar, D., n.d.). For Register feature, the number of register usage are able to service in malware classification as register renaming are used to make malware analysis more difficult to detect it (Christodorescu, M., Song, D. Bryant, R.E., 2005). For Application Programm ing Interface feature, API occupational group are code that call the function of other software in our case it will be Windows API. There are immense number of type of API calls in malicious and non-malicious software, is hard to differentiate them, because of this we will be focusing on top frequent used API calls in malware binaries in order to down the result closer ( wind maliciously used apis, 2017). For Data Define feature, because not all of malware contains API calls, and these malware that does not have any API calls they are mainly contain of outgrowth code which usually are db, dw, dd, there are sets of features (DP) that are able to define malware (Ahmadi, M. et al., 2016). For Miscellaneous feature, we choose a few word that most malware have in common from the malware dissemble file (Ahmadi, M. et al., 2016).Among so many feature, the most appropriate feature for our research will be N-gram, and Opcode. This is because it is proven that there two feature have the hi ghest accuracy with low logloss. This two feature appears frequently in malware file and it already have sets of well-known features for malware. But the drawback using N-gram and Opcode are they require a lot of resource to process and take a lot of time (Ahmadi, M. et al., 2016). We will also try other feature to compare with N-gram and Opcode to verified the result.2.4 ClassificationIn this section, we will not review about the algorithm or mathematical formula of a classifier but rather their nature to able to have advantage over certain condition in classifying malware feature. The type of classifier that we will review will be Nearest Neighbor, Nave Bayes, finish tree, Support Vector Machine and XGBOOST 21 (Kotsiantis, S.B., 2007) (Ahmadi, M. et al., 2016).As we need a classifier to train our data with the malware feature, we will need to review the classifier to choose the most appropriate classifier that are able to have the best result. The Nearest Neighbor classifier are one of the simplest method for classifying and it is normally implement in case-based reasoning 21. As for Nave Bayes, it usually generates simply and constraint model and not suitable for irregular data input, which make it not suitable for malware classification because that the data in malware classification are not regular (Kotsiantis, S.B., 2007). For Decision Tree, it classify feature by sorting them into tree node base on their feature values and each carve up represent the node value. Decision Tree will determine either try or false based on node value, which make it difficult to dealt with unknown feature that are not stored in tree node (Kotsiantis, S.B., 2007). For Support Vector Machine, it has a complexity model which enable it to deal with large amount of feature and still be able to obtain good result from it, which make it suitable for malware classification as malware contains large number of feature (Kotsiantis, S.B., 2007). For XGBOOST, it is a scalable tree boos ting system which win many machine learning competition by achieving state of art result. The advantage for XGBOOST, it is suitable for most of any scenario and it run faster than most of other classification technique (Chen, T., n.d.).To choose a Classification for our malware analysis, we will be choosing XGBOOST, as it is suitable for malware classification, it also recommended by winner from Microsoft Malware Classification Challenge (Ahmadi, M. et al., 2016). But we will also use Support Vector Machine, as it too is suitable for malware classification and we will use it to compare the result with XGBOOST to get a more accurate result.ReferencesAhmadi, M. et al., 2016. Novel Feature Extraction, Selection and Fusion for Effective Malware Family Classification. ACM assembly on Data and Application auspices and Privacy, pp.183-194. operable at http//doi.acm.org/10.1145/2857705.2857713.Amin, M. Maitri, 2016. A Survey of Financial Losses Due to Malware. proceeding of the Second world-wide Conference on entropy and Communication Technology for Competitive Strategies ICTCS 16, pp.1-4. Available at http//dl.acm.org/citation.cfm?doid=2905055.2905362.Berlin, K., Slater, D. Saxe, J., 2015. Malicious Behavior Detection Using Windows Audit Logs. Proceedings of the 8th ACM Workshop on man-made Intelligence and Security, pp.35-44. Available at http//doi.acm.org/10.1145/2808769.2808773.Feng, Z. et al., 2015. HRS A Hybrid Framework for Malware Detection. , (10), pp.19-26.Han, K., Lim, J.H. Im, E.G., 2013. Malware analysis method using visualization of binary files. Proceedings of the 2013 Research in adaptive and Convergent Systems, pp.317-321.Kim, T.G., Kang, B. Im, E.G., 2013. Malware classification method via binary content comparison. Information (Japan), 16(8 A), pp.5773-5788.Kksille, E.U., Yalnkaya, M.A. Uar, O., 2014. Physical Dangers in the Cyber Security and Precautions to be Taken. Proceedings of the 7th International Conference on Security of Informa tion and Networks SIN 14, pp.310-317. Available at http//dl.acm.org.proxy1.athensams.net/citation.cfm?id=2659651.2659731.Lanzi, A. et al., 2010. AccessMiner Using System-Centric Models for Malware Protection. Proceedings of the 17th ACM Conference on Computer and Communications Security CCS10, pp.399-412. Available at http//dl.acm.org/citation.cfm?id=1866353%5Cnhttp//portal.acm.org/citation.cfm?doid=1866307.1866353.Nicholas, C. Brandon, R., 2015. Document Engineering Issues in Document Analysis. Proceedings of the 2015 ACM Symposium on Document Engineering, pp.229-230. Available at http//doi.acm.org/10.1145/2682571.2801033.Patanaik, C.K., Barbhuiya, F.A. Nandi, S., 2012. Obfuscated malware detection using API call dependency. Proceedings of the First International Conference on Security of Internet of Things SecurIT 12, pp.185-193. Available at http//www.scopus.com/inward/record.url?eid=2-s2.0-84879830981partnerID=tZOtx3y1.Pluskal, O., 2015. Behavioural Malware Detection Using Efficient SVM Implementation. RACS Proceedings of the 2015 Conference on research in adaptive and convergent systems, pp.296-301.Santos, I. et al., 2013. Opcode sequences as representation of executables for data-mining-based unknown malware detection. Information Sciences, 231, pp.64-82.Stewin, P. Bystrov, I., 2016. Detection of Intrusions and Malware, and Vulnerability Assessment, Available at http//dblp.uni-trier.de/db/conf/dimva/dimva2012.htmlStewinB12.Willems, G., Holz, T. Freiling, F., 2007. Toward automated dynamic malware analysis using CWSandbox. IEEE Security and Privacy, 5(2), pp.32-39.Tabish, S.M., Shafiq, M.Z. Farooq, M., 2009. Malware detection using statistical analysis of byte-level file content. Proceedings of the ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics CSI-KDD 09, pp.23-31. Available at http//portal.acm.org/citation.cfm?doid=1599272.1599278.Lyda, R.,Hamrock, J,. 2007.Using Entropy Analysis to Find Encrypted and Packed Malware.Nataraj, L. et al., 2011. Malware Images Visualization and Automatic Classification.Bilar, D., Statistical Structures Fingerprinting Malware for Classification and Analysis Why Structural Fingerprinting?Christodorescu, M., Song, D. Bryant, R.E., 2005. Semantics-Aware Malware Detection.Top maliciously used apis. https //www.bnxnet.com/top-maliciously-used-apis/, 2017.Weiss, S.M. Kapouleas, I., 1989. An Empirical Comparison of Pattern Recognition , Neural Nets , and Machine Learning Classification Methods. , pp.781-787.Kotsiantis, S.B., 2007. Supervised Machine Learning A look back of Classification Techniques. , 31, pp.249-268.Chen, T., XGBoost A Scalable Tree Boosting System.