TY - JOUR
AU1 - Kozachok, A. V.
AU2 - Spirin, A. A.
AB - Classification of high-entropy data sources is one of the key problems in the field of information security. Currently, there are many methods for classification of encrypted and compressed sequences; however, they mostly use digital signatures or service information found in the headers of the containers used to store or transfer data. This paper analyzes the state of research in the field of classification of encrypted and compressed data and develops a model of encrypted and compressed sequences. Our experiments demonstrate a high accuracy of the proposed approach, which allows us to conclude that the methods for classifying encrypted and compressed data used in our study have been improved. The approach can be implemented in data leak prevention systems or corporate email systems to analyze the attachments sent outside the controlled perimeter of a government agency or enterprise.Purpose of the research – develop a model of pseudo-random sequences generated by data encryption and compression algorithms that most accurately reflects statistical properties of these sequences.Methods of the research – statistical data analysis, mathematical statistics, and machine learning.Result of the research – An analysis of the studies aimed at solving the problem of classification for encrypted and compressed sequences in the field of information security is carried out. A model of pseudo-random sequences generated by encryption and compression algorithms is developed taking into account their statistical features: distribution of bytes and distribution of subsequences of limited length, which constitute a new probabilistic space. The choice of the statistical features used in the pseudo-random sequence model is justified. Experiments for determining the hyperparameters of the classifier on a dataset generated from encrypted and compressed files without taking their headers into account are carried out. The constraints used in the pseudo-random sequence model, namely, the length of pseudo-random sequences (approximately 600 Kb), are defined. Experiments for determining the effect of the statistical features used in the model on classification accuracy are conducted. The proposed approach allows encrypted and compressed data to be classified with an accuracy of 0.97.
TI - Model of Pseudo-Random Sequences Generated by Encryption and Compression Algorithms
JF - Programming and Computer Software
DO - 10.1134/s0361768821040058
DA - 2021-07-30
UR - https://www.deepdyve.com/lp/springer-journals/model-of-pseudo-random-sequences-generated-by-encryption-and-8tEkAmqaFQ
SP - 249
EP - 260
VL - 47
IS - 4
DP - DeepDyve
ER -