A New Software Birthmark based on Weight Sequences of Dynamic Control Flow Graph for Plagiarism Detection

A New Software Birthmark based on Weight Sequences of Dynamic Control Flow Graph for Plagiarism... Abstract With the large-scale development of open source software, software plagiarism has become a serious threat to software industry and intellectual property. As the latest technique of plagiarism detection, dynamic software birthmark has attracted much attention in recent years. Most of the existing dynamic birthmarks focus on how to resist obfuscation techniques such as compiler optimizations and strong obfuscations implemented in tools. However, they pay little attention to packers, especially encryption packer which is commonly used in software protection as well as plagiarism. When used to encrypt software, the decryption code is added to the binary. It is hard to distinguish the original parts of software from the decryption parts using traditional dynamic birthmarks. In this paper, we propose a new dynamic software birthmark called weight sequences birthmark (WSB) which is based on weight sequences of dynamic control flow graph (DCFG). The weight sequences are used as characteristics, which make full use of the different patterns of dynamic basic block replications between the original code and the decryption code. Compared with the-state-of-art dynamic key instruction sequence birthmark (DKISB), the new birthmark can resist encryption packer effectively. Furthermore, WSB shows better credibility than DKISB when distinguishing independent programs. The comprehensive experiments illustrate that the value of extended F-measure can reach 96.8%, indicating that it is a high-quality birthmark which satisfies both the credibility and the resiliency. 1. INTRODUCTION Software plagiarism, also known as software piracy, is the illegal copying, distribution or use of unlicensed software. According to the Business Software Alliance’s 2016 global software survey, 39% of software installed on PCs around the world in the past year was not properly licensed and their commercial value was $52.2 billion [1]. The piracy and misuse of software license in China has always ranked first among the top 20 countries since 2015 [2], according to the revulytics software piracy statistics in 2017. Software plagiarism has become a serious threat to software industry and intellectual property. Plagiarism detection calls for effective methods to protect the software copyright property and mitigate such economic losses. The plagiarism detection methods can be divided into three categories: source code based, watermark based and birthmark based. Source code-based detection usually uses the methods of text similarity or structural analysis rarely involving semantic analysis [3]. While source code is often protected by obfuscation techniques [4], the anti-obfuscation of this method is weak. When the software is distributed in binary form that leads to the lack of source code, this method does not work at all. Software watermark is the most well-known and earliest approach to software protection [5]. It is implemented by embedding a unique identifier (watermark) into the protected software before it is released. The watermark is hard to remove but easy to verify, so it serves as strong evidence for plagiarism judgment. However, it has been proven by Collberg et al. that ‘a sufficiently determined attacker will eventually be able to defeat any watermark’ [6]. Birthmark-based scheme extracts the unique characteristics of software and then the similarity of them is measured to identify plagiarism [7]. Software birthmark is superior to the other two methods because it is extracted from software directly without requiring source code and extra data. Depending on extracting schemes, the birthmark can be categorized as static birthmark and dynamic birthmark [8]. Static birthmark comes from disassembler or source code, while dynamic birthmark is extracted from runtime information obtained by monitoring execution. The static birthmark is hard to deal with the obfuscated software whose disassembler code structure and content may be changed [9]. Furthermore, the static birthmark is invalid when the software cannot be disassembled after been obfuscated or packed [10]. While dynamic software birthmark can bypass the problem of disassembly and has effective performance of anti-obfuscation [11]. But most of software are protected by shell technology [12] that makes original program blend into the unpack program. When extracting birthmark dynamically, the characteristics such as instructions, APIs and system calls from the unpack program are difficult to eliminate. Most of current dynamic birthmarks ignore the problem how to filter out these irrelevant characteristics. As far as we know, DKISB [13] is the only one birthmark which uses dynamic data flow tracking (DDFT) to filter out information unrelated to the input. While in the case of using a password or key to encrypt such as MidgetPack,1 DKISB is also invalid. Because the DDFT module also tracks the key or password which does not belong to the original program input. Furthermore, the module will run out of the memory space for the complicated decryption process. In this paper, to solve the above problems, we propose a new dynamic software birthmark called weight sequences birthmark (WSB). The memory consumption of WSB is smaller than DKISB because it only records the branch jump address and the first address of basic block. After recording, the log records are parsed offline to construct the weighted dynamic control flow graph (DCFG). The basic blocks are extracted as nodes marked with the first address. The basic block call relationships are extracted as edges which come from branch jump records. And the call relationship repetitions are extracted as the edge weights. To avoid comparing the graph similarity directly which may come up against the subgraph isomorphism problem, graph characteristics [14] are extracted before comparing the graph similarity. We traverse the graph and obtain a n-gram2 set of weight sequences as the characteristics of software. The n-gram set is converted into birthmark by merging the numerically similar sequences into one item as a dimension of the characteristic vector. The value of each dimension represents the frequency of this kind of sequence. Finally, the extended cosine of two vectors is calculated as the similarity of two birthmarks. It estimates plagiarism according to similarity and pre-defined threshold. The contributions of this paper are summarized as follows: Weight sequences of DCFG, a novel character representation of software behaviors, are extracted to construct new dynamic software birthmark. Utilizing the different patterns of dynamic basic block replications, the new birthmark is still effective when the program is encrypted by the packer. Compared with DKISB, the new birthmark shows better credibility when distinguishing independent programs. And its resiliency and credibility evaluated on extensive experiments prove that it is a high-quality birthmark. The rest of this paper is organized as follow: Section 2 reviews related work in the field of software plagiarism detection. The overview of the framework is presented in Section 3 first. Then it introduces the main points of WSB including the specific definition, the construction of weighted DCFG, the extraction process of WSB, the extended cosine similarity algorithm as well as how to make final decision. The performance of WSB is evaluated and compared with DKISB through extensive experiments in Section 4. Finally, conclusions are drawn in Section 5. 2. RELATED WORK There are various studies of software plagiarism detection. The earliest studies focus on source code which is considered as text with specific grammatical structure. For instance, Grier [15] counted the total number of variables, input statements, conditional statements, loop statements, assignment statements and the call numbers as the attributes for Pascal programs plagiarism detection. Another earlier study, software watermark, was proposed by Collberg and Thomborson [5]. In this method, the watermark was embedded into the executable before its release. It could be extracted from executable files to identify whether two programs were homologous. The methods, such as MOSS [16] and YAP3 [17], were based on token sequences or syntactic trees. And JPLAG [18] constructed the program dependency graph (PDG) from source code. Flores et al. [19] recently proposed a series of methods on source code re-use. DeSoCoRe was a source code re-use detection tool based on techniques of Natural Language Processing. This tool could provide an understandable output to the human reviewer. In [20], the authors presented an approach based on the comparison of programs at character level which could find potential cases of re-use across a huge number of assignments. They proved that this approach was better than JPLAG. They also compared a Latent Semantic Analysis (LSA) approach [21] with previously used text re-use detection models for measuring cross-language similarity in source code. The experiments showed that the LSA-based approach was slightly better than the other models. Moreover, shared tasks on plagiarism detection in source code were recently organized [22, 23]. The concept of software birthmark was put forward first by Tamada et al. [7, 24], which was composed of four individual birthmarks: constant values in field variables (CVFV), sequence of method calls (SMC), inheritance structure (IS) and used classes (UC). But it was also based on source code. To overcome the fundamental limitations of depending on extra data such as source code, many software birthmarks based on the binary code were proposed. All birthmarks can be divided into two main categories: static birthmark and dynamic birthmark. Static birthmark comes from disassembler or source code. Myles and Collberg [25] proposed a k-gram-based static birthmark for Java. It was extracted by sliding a window of length k over the static instruction sequences. Compared with Tamada’s birthmark, their birthmark showed better robustness and did not need source code. Another birthmark based on the control flow information was proposed by Lim et al. [26]. A set of behaviors of flow paths was used as birthmark. This birthmark was effective even in cases where such software was aggressively modified. However, it was still susceptible to simple code obfuscation techniques such as opaque branch implantation, basic block splitting and garbage instruction implantation. Cop [27] was a binary-oriented, obfuscation resilient prototype using birthmark based on the concept, the longest common subsequence of semantically equivalent basic blocks, which were modeled by a set of symbolic formulas representing the input–output relations of the block. It had stronger resiliency to code obfuscation techniques as well as other semantics-preserving transformations. While the huge overhead introduced by symbolic execution and constraint solving made Cop difficult to handle large-scale software. Dynamic birthmark is extracted during execution. Compared with static birthmark, it reflects the actual behaviors of the program. Therefore, it is more resilient to code obfuscation. Dynamic birthmark can be divided into three categories: graph-based, sequence-based and set-based. Graph-based: The concept of dynamic software birthmark was also proposed by Myles and Collberg [28]. They presented a graph-based birthmark, whole program path (WPP) that was in the form of a directed acyclic graph, which uniquely identifies a program based on a complete control flow trace of execution. They verified that it could be used to identify program theft even when an embedded watermark was destroyed by program transformation. A similar graph-based software birthmark called system call dependence graph (SCDG) was proposed by Wang et al. [29]. The SCDG was defined by using the system call as vertices and the dependences of system calls as edges. They demonstrated that SCDG has the capacity of detecting component theft where only partial code is stolen and it is resilient to various evasion techniques. But birthmarks generated from control flow traces or system calls were still vulnerable to obfuscation attack such as loop transformation and system calls replacement. Another graph-based dynamic birthmark was proposed by Patrick et al. [30]. They made use of the heap memory at runtime to construct object reference graph (ORG) as birthmark. They verified that their birthmark remained intact even after the testing software were obfuscated by the state-of-the-art Allatori obfuscator.3 However, the similarity of graph was usually based on subgraph isomorphism algorithm which was a NP-complete problem. Sequence-based: In [31], Tamada et al. proposed a method for collecting history of API function calls during execution. And they extracted two types of birthmarks from the log of calls. One was EXESEQ (sequence of API function calls) and the other was EXEFREQ (frequency of API function calls). Their analysis showed that these birthmarks have good tolerance against various kinds of program transformation attacks. DAAV [32] was also a dynamic birthmark based on API. The dynamic API call graph (DACG) was first constructed and then birthmark was extracted form DACG by converting it into vector with random walk. In [33], two system call-based software birthmarks, System Call Short Sequence Birthmark (SCSSB) and Input Dependant System Call Subsequence Birthmark (IDSCSB), were proposed by Wang et al. The SCSSB was extracted by removing the commonly found short sequences from the system call short sequences. And the IDSCSB was obtained from two system call sequences by excluding the same part which appears in both sequences with two different inputs. SCSSB and IDSCSB were suitable for partial code theft. In addition to making use of APIs or system calls, the values appeared during execution could also be used to build birthmarks. In [34], Jhi et al. proposed an approach based on core values and implemented a prototype with a dynamic taint analyzer atop a generic processor emulator. They defined core value that it should be output of a value-updating instruction and closely related to program semantics. Their experiments showed that this birthmark was resilient to various control and data obfuscation techniques. LoPD [35] was another kind of sequence-based birthmark which used strict formal logic analysis to capture program semantics. Although LoPD could effectively resist various semantic retention code confusions due to strict semantic equivalence, it was constrained by the weakness of symbol execution and constraint solving. Set-based: In [11], a dynamic k-gram-based birthmark was proposed. They used the k-gram set of instruction sequences as the unique characteristics. It proved that this birthmark was more resilient to semantics-preserving transformations than the static k-gram birthmark. However, as the scale of software increases, the dynamic instruction sequences set becomes larger, which will make it more difficult to deal with. To alleviate this problem, Lee et al. [36] proposed a dynamic birthmark based on categorization of instructions. It reduced the volume of characteristics set without damaging the meaning of sequences. During execution, the commonly used instructions accounted for the majority, which made distinguishing independent software difficult. Besides, it would introduce irrelevant instructions when the software was packed for self-protection. In order to find the sets of instruction sequences that were only closely related to program function, Tian et al. [13] proposed DKISB that could filter out the unrelated instructions. It used DDFT and only recorded key instructions that were related to program input. This birthmark was resilient to various kinds of semantic-preserving code obfuscation techniques. However, when dealing with the software packed by MidgetPack, it would lead to run out of memory and unable to extract correct instructions. 3. METHOD In this section, the framework overview of software plagiarism detection based on WSB is first introduced along with some key definitions. Then the conceptions of software birthmark and dynamic software birthmark are listed. And the detail descriptions of each step of WSB are followed. 3.1. Overview The framework consists of three main steps as is shown in Fig. 1. In the first step, DynamoRIO4 is used to record the first address of basic blocks as well as the branch jumps and then outputs the log trace. In the second step, we construct the weighted DCFG from the log. A node n in the graph is a dynamic basic block and an edge n0→n1 represents that node n1 is executed immediately after node n0. The weight in the graph represents the repetitions of this edge which is the number of times it appears in the log. Then the weight sequences are extracted by traversing the graph and converted into birthmark. These weight sequences can be considered as a set of n-grams denoted as ⟨W⃗,V⟩, where W⃗=w⃗(w0,w1,…,wn) is the weight sequence and V is the frequency accumulation of the n-gram items with same W⃗. For the simplified representation, the W⃗ can be convert into polar coordinate K=(r,α) according to the formula as follows:   {r=∣w⃗∣;Cosine(α)=w⃗·e⃗∣w⃗∣∣e⃗∣. (1)where e⃗ is the unit vector. In this step, the Weighted DCFG and Weight Sequences Birthmark are defined which are the key definitions in our method. Definition 3.1 (Weighted DCFG) The Weighted Dynamic Control Flow Graph of a program is based on the granularity of dynamic basic block and is a 3-tuple directed graph WDCFG=(N,E,W), where N, E, Wsatisfy the flowing conditions: N is a set of nodes, where node n∈N represents a dynamic basic block. E is a set of edges, where edge e(n0→n1)∈E represents that n1 is executed immediately after n0 is executed. W is a set of weights, where weight w∈W corresponds to one edge and it represents the repetitions of this edge during the program execution. Definition 3.2 (Weight Sequences Birthmark) The WSB is a dynamic software birthmark based on weight sequences of the graph defined in Definition3.1. It is a set denoted as ⟨K,V⟩={⟨k0,v0⟩,⟨k1,v1⟩,…,⟨kn,vn⟩}, where Vis frequency accumulation of the n-gram items with the same Kand Kis a polar coordinate point with the angle to unit vector and Euclidean distance of the item. Figure 1. View largeDownload slide Software Plagiarism Detection Framework based on WSB. Figure 1. View largeDownload slide Software Plagiarism Detection Framework based on WSB. In the third step, the similarity of birthmark is measured based on the extended cosine algorithm. According to the similarity value and threshold ε, a decision is made on whether plagiarism exists. 3.2. Software birthmark According to [7], the definition of software birthmark is: Definition 3.3 (Software Birthmark) Let p, qbe programs and ≡cpbe a given copy relation. Let f(p)be a set of characteristics extracted from pby a certain method f. Then f(p)is called a birthmark of punder ≡cpiff both of the following conditions are satisfied. Condition 1 f(p) is obtained only from p itself (without any extra information). Condition 2 p≡cp q⇒f(p)=f(q). Condition 1 means that the birthmark is the inherent characteristics of program and is not extra information. Regardless of the way in which the birthmark is extracted, the input of the method f is only the program itself. Hence, it is different from watermark that requires extra data. Condition 2 states that the same birthmark should be obtained from copied programs. By the contraposition, if birthmarks f(p) and f(q) are different, then p≡cp q does not hold, indicating that q is not a copy of p. Condition 2 reflects the property that birthmark should be against all scenes of plagiarism. In fact, it is hard to find such strong birthmark that perfectly satisfies this property. On the other hand, if two software are developed independently, their birthmarks should be different. The birthmark should not only be able to identify plagiarism but also be able to distinguish different software. And the concept of dynamic software birthmark proposed by Myles and Collberg [28] is defined as follows: Definition 3.4 (Dynamic Software Birthmark) Let p, qbe two programs and ibe an input to these programs. Let fbe a method for extracting a set of characteristics from a program. Then f(p,i)is a dynamic birthmark of piff: Condition 1 f(p,i)is obtained only from pitself by executing pwith the given input i. Condition 2 qis a copy of p⇒f(p,i)=f(q,i). Compared with the definitions of software birthmark, dynamic birthmark is closely related to the runtime environment. 3.3. The Proposed WSB 3.3.1. Dynamic instrumentation To generate log trace, DynamoRIO is used. It allows arbitrary modifications to application instructions via a powerful library. Depending on this tool, we implement a plug-in program to record the first address of basic block and the branch jump that occurs at the end of the basic block when encountering branch instructions such as jne and je. With a specific input, the test program executes on DynamoRIO along with the plug-in program. For each thread, there is a separate log file to record its branch jumps. Therefore, multithreaded program generates several logs. 3.3.2. Weighted DCFG construction After log is generated, it is parsed to construct the Weighted DCFG. Figure 2 is a simplified schematic diagram of Weighted DCFG. The nodes A∼H are basic blocks recorded during execution. The arrow from A to B is an edge of the Weighted DCFG, and w1 is the weight of this edge. There are eight nodes, nine edges and nine weights in total in the Diagram. Although the ring does not occur in this diagram, it exists because the repetitions of dynamic basic blocks are very common. Figure 2. View largeDownload slide Schematic Diagram of Weighted DCFG. Figure 2. View largeDownload slide Schematic Diagram of Weighted DCFG. Nodes and edges are extracted from the log and the weight of each edge is also counted. As is shown in Fig. 3, for multithread program, each log is parsed separately. Then the results are aggregated. After extraction, the intermediate results are output as a XML text. In the XML text, the Weighted DCFG is represented by adjacency graph. Figure 3. View largeDownload slide The WDCFG Construction. Figure 3. View largeDownload slide The WDCFG Construction. 3.3.3. WSB extraction Although the graph can be considered as birthmark, the subgraph isomorphism is a NP-complete problem. The nodes of Weighted DCFG are generally as many as thousands. Therefore, the graph-based similarity algorithm is not applicable to Weighted DCFG. Considering that in order to achieve a specific function, some basic operations implemented by basic blocks cannot be arbitrarily lost. Although some additional operations may be added to obfuscate the original operations, the obfuscation operations are small parts of total operations and the majorities are still basic operations. The patterns between the two kinds of operations are different. To take advantage of the patterns, we focus on the weight of Weighted DCFG and extract the sequences of weight to indirectly represent the characteristics of software. To extract the sequences of weight, the depth-first search algorithm is used to traverse Weighted DCFG. A n-gram set is then generated by sliding from the start of the sequence to the end of it with a constant length N [11]. Figure 4 shows the process of extracting the n-gram items from the schematic diagram in Fig. 2. In this case, the length of N equals 4. The results are: (w1,w2,w3,w4), (w2,w3,w4,w5), (w3,w4,w5,w7), (w4,w5,w7,w9), (w2,w3,w4,w6), (w3,w4,w6,w9) and (w1,w8,w7,w9). Figure 4. View largeDownload slide The Process of Extracting N-gram Items. Figure 4. View largeDownload slide The Process of Extracting N-gram Items. To improve the resiliency of birthmark and to simplify the representation, the key of each item is expressed in polar coordinates and the items with the same key are merged. Finally, the merged set becomes WSB. 3.3.4. Similarity measurement In the literature of plagiarism detection based on birthmark, the decision is made according to the similarity and the threshold ε. For different birthmarks, the scope of ε varies. To measure the similarity of WSB, it is treated as vector with each key as a dimension. Then extended cosine value is calculated to reflect the similarity of two WSBs. It is defined as follows: Definition 3.5 (The Extended Cosine Similarity of WSB) For two WSBs A={⟨k0,v0⟩,⟨k1,v1⟩,…,⟨kn,vn⟩}and B={⟨k0′,v0′⟩,⟨k1′,v1′⟩,…,⟨km′,vm′⟩}, let S=KeySet(A)∪KeySet(B). Then the vector a⃗=(a0,a1,…,ar)is constructed according to the following rules:   ai={vi,si∈KeySet(A);0,si∉KeySet(A). (2)where 0≤i≤rand viis the value of key siin A. Likewise the vector b⃗=(b0,b1,…,br)can be constructed. Then the extended cosine similarity of a⃗and b⃗can be calculated with the following formula:   Ex−Cosine(a⃗,b⃗)=a⃗·b⃗∣a⃗∣∣b⃗∣×θ,θ=min{∣a⃗∣,∣b⃗∣}max{∣a⃗∣,∣b⃗∣} (3)The result of Ex−Cosine(a⃗,b⃗) is the similarity of WSB. 3.3.5. Making decision The primary purpose of extracting birthmarks and measuring their similarity is to judge plagiarism. To exclude the influence of random factors as much as possible, detecting a pair of programs requires multiple experiments conducted with different inputs. The average similarity of these experiments is compared with a pre-defined threshold ε to make the final decision according the following rules:   AvgSim(PA,PB)={≥1−ε,copied;≤ε,independent;otherwise,inconclusive. (4) When the average similarity is greater than 1−ε, they are classified as ‘copied’. And when it is less than ε, they are classified as ‘independent’. Otherwise, the pair is inconclusive. With ε in the range of 0.05∼0.50 and N from 2 to 8, the extended precision, recall and F-Measure defined in [37] are tested to comprehensively evaluate the proposed birthmark in our experiments. When ε equals 0.5 and N equals 3, the F-Measure reaches the maximum. More detailed descriptions of experiments are given in the next section. 4. EVALUATION AND EXPERIMENT To evaluate the performance of the proposed birthmark, extensive experiments have been conducted to verify the credibility and the resiliency of it. In this section, we introduce the evaluation criteria of software birthmark and the dataset used in our experiments first. Then experiments are designed from two aspects to illustrate the advantages of the proposed birthmark. Finally, we discuss the parameters N and ε in our experiments. 4.1. Evaluation criteria and experiment dataset A high-quality birthmark should not only be able to identify plagiarism but also be able to distinguish independent software. In the literature [25], it is expressed as the following properties: Resiliency Let p be a program and p′ be a derivative version generated by applying semantic-preserving code transformations τ to p. Then we say the birthmark Bp is resilient to τ if sim(p,p′)≥1−ε. Credibility Let p and q be independently developed programs which may accomplish the same task. Then we say the birthmark Bp is credible if sim(p,q)≤ε. According to the criteria, extensive experiments are conducted to prove the performance of WSB. Most of the test software come from the DKISB [13] used. In addition to their obfuscated versions, we also test the versions that are packed by MidgetPack. Table 1 shows the dataset of our experiments without their plagiarism versions. Table 2 is the obfuscation strategies and tools used in experiments. Table 1. Dataset of test software. Category  Software  Version  Com&Dec  bzip2  1.0.6  zip  3.0  rar  5.0  lzip  1.13-rc2  tar  1.27  pigz  2.3  Image Process  qiv  2.2.4  feh  2.2  pho  0.9.8  sxiv  1  Enc&Dec  md5  8.13  OL  1.0.1  Java Program  JLex  1.4  JCup  0.1  Cal  0.1  Category  Software  Version  Com&Dec  bzip2  1.0.6  zip  3.0  rar  5.0  lzip  1.13-rc2  tar  1.27  pigz  2.3  Image Process  qiv  2.2.4  feh  2.2  pho  0.9.8  sxiv  1  Enc&Dec  md5  8.13  OL  1.0.1  Java Program  JLex  1.4  JCup  0.1  Cal  0.1  View Large Table 1. Dataset of test software. Category  Software  Version  Com&Dec  bzip2  1.0.6  zip  3.0  rar  5.0  lzip  1.13-rc2  tar  1.27  pigz  2.3  Image Process  qiv  2.2.4  feh  2.2  pho  0.9.8  sxiv  1  Enc&Dec  md5  8.13  OL  1.0.1  Java Program  JLex  1.4  JCup  0.1  Cal  0.1  Category  Software  Version  Com&Dec  bzip2  1.0.6  zip  3.0  rar  5.0  lzip  1.13-rc2  tar  1.27  pigz  2.3  Image Process  qiv  2.2.4  feh  2.2  pho  0.9.8  sxiv  1  Enc&Dec  md5  8.13  OL  1.0.1  Java Program  JLex  1.4  JCup  0.1  Cal  0.1  View Large Table 2. Obfuscation strategies and tools. Strategies  Tools  Compiler  llvm3.2 and gcc4.6.3  Obfuscator  Allatori,JShrink,ProGuard,KlassMaster  Packer  Upx,MidgetPack  Strategies  Tools  Compiler  llvm3.2 and gcc4.6.3  Obfuscator  Allatori,JShrink,ProGuard,KlassMaster  Packer  Upx,MidgetPack  View Large Table 2. Obfuscation strategies and tools. Strategies  Tools  Compiler  llvm3.2 and gcc4.6.3  Obfuscator  Allatori,JShrink,ProGuard,KlassMaster  Packer  Upx,MidgetPack  Strategies  Tools  Compiler  llvm3.2 and gcc4.6.3  Obfuscator  Allatori,JShrink,ProGuard,KlassMaster  Packer  Upx,MidgetPack  View Large 4.2. Credibility To evaluate the credibility of WSB, the similarities of all pairs in Table 1 are measured. There are in total 16 independent programs which can be divided into four categories: compression&decompression (Com&Dec), image process, encryption&decryption (Enc&Dec) and Java program. In Table 3, 14 of them are chosen to show the similarities of independent programs. It is observed that most of similarities are less than 0.5 except for several counterexamples. The low similarities prove that the proposed WSB is credible. The maximum similarity is 0.7 in this table. It is because that pho and qiv both belong to image process software and they share many image processing libraries. To compare with DKISB, the similarities are also calculated using DKISB in the same condition. And the results are illustrated in Table 4. The similarities of image processing software in DKISB are higher than that in WSB. Moreover, the similarity of Cal and JLex in DKISB is up to 0.998. It is very close to 1 which can make the decision exactly the opposite. However, it is only 0.547 in WSB, indicating that the performance on credibility of WSB is the better. Table 3. Credibility of WSB. Software  gzip  bzip2  zip  rar  lzip  feh  qiv  sxiv  pho  md5  OL  Cal  JCup  JLex  gzip    0.055  0.475  0.214  0.180  0.164  0.063  0.157  0.047  0.517  0.275  0.185  0.078  0.113  bzip2      0.069  0.114  0.116  0.111  0.184  0.177  0.125  0.041  0.087  0.190  0.316  0.319  zip        0.341  0.287  0.251  0.095  0.240  0.081  0.312  0.376  0.282  0.117  0.176  rar          0.372  0.346  0.149  0.324  0.121  0.206  0.413  0.363  0.173  0.258  lzip            0.177  0.080  0.187  0.069  0.141  0.319  0.255  0.121  0.210  feh              0.258  0.555  0.174  0.162  0.355  0.565  0.254  0.383  qiv                0.327  0.700  0.062  0.150  0.253  0.570  0.361  sxiv                  0.277  0.143  0.363  0.591  0.372  0.530  pho                    0.046  0.142  0.226  0.508  0.345  md5                      0.235  0.171  0.069  0.102  OL                        0.369  0.185  0.261  Cal                          0.365  0.574  JCup                            0.528  JLex                              Software  gzip  bzip2  zip  rar  lzip  feh  qiv  sxiv  pho  md5  OL  Cal  JCup  JLex  gzip    0.055  0.475  0.214  0.180  0.164  0.063  0.157  0.047  0.517  0.275  0.185  0.078  0.113  bzip2      0.069  0.114  0.116  0.111  0.184  0.177  0.125  0.041  0.087  0.190  0.316  0.319  zip        0.341  0.287  0.251  0.095  0.240  0.081  0.312  0.376  0.282  0.117  0.176  rar          0.372  0.346  0.149  0.324  0.121  0.206  0.413  0.363  0.173  0.258  lzip            0.177  0.080  0.187  0.069  0.141  0.319  0.255  0.121  0.210  feh              0.258  0.555  0.174  0.162  0.355  0.565  0.254  0.383  qiv                0.327  0.700  0.062  0.150  0.253  0.570  0.361  sxiv                  0.277  0.143  0.363  0.591  0.372  0.530  pho                    0.046  0.142  0.226  0.508  0.345  md5                      0.235  0.171  0.069  0.102  OL                        0.369  0.185  0.261  Cal                          0.365  0.574  JCup                            0.528  JLex                              View Large Table 3. Credibility of WSB. Software  gzip  bzip2  zip  rar  lzip  feh  qiv  sxiv  pho  md5  OL  Cal  JCup  JLex  gzip    0.055  0.475  0.214  0.180  0.164  0.063  0.157  0.047  0.517  0.275  0.185  0.078  0.113  bzip2      0.069  0.114  0.116  0.111  0.184  0.177  0.125  0.041  0.087  0.190  0.316  0.319  zip        0.341  0.287  0.251  0.095  0.240  0.081  0.312  0.376  0.282  0.117  0.176  rar          0.372  0.346  0.149  0.324  0.121  0.206  0.413  0.363  0.173  0.258  lzip            0.177  0.080  0.187  0.069  0.141  0.319  0.255  0.121  0.210  feh              0.258  0.555  0.174  0.162  0.355  0.565  0.254  0.383  qiv                0.327  0.700  0.062  0.150  0.253  0.570  0.361  sxiv                  0.277  0.143  0.363  0.591  0.372  0.530  pho                    0.046  0.142  0.226  0.508  0.345  md5                      0.235  0.171  0.069  0.102  OL                        0.369  0.185  0.261  Cal                          0.365  0.574  JCup                            0.528  JLex                              Software  gzip  bzip2  zip  rar  lzip  feh  qiv  sxiv  pho  md5  OL  Cal  JCup  JLex  gzip    0.055  0.475  0.214  0.180  0.164  0.063  0.157  0.047  0.517  0.275  0.185  0.078  0.113  bzip2      0.069  0.114  0.116  0.111  0.184  0.177  0.125  0.041  0.087  0.190  0.316  0.319  zip        0.341  0.287  0.251  0.095  0.240  0.081  0.312  0.376  0.282  0.117  0.176  rar          0.372  0.346  0.149  0.324  0.121  0.206  0.413  0.363  0.173  0.258  lzip            0.177  0.080  0.187  0.069  0.141  0.319  0.255  0.121  0.210  feh              0.258  0.555  0.174  0.162  0.355  0.565  0.254  0.383  qiv                0.327  0.700  0.062  0.150  0.253  0.570  0.361  sxiv                  0.277  0.143  0.363  0.591  0.372  0.530  pho                    0.046  0.142  0.226  0.508  0.345  md5                      0.235  0.171  0.069  0.102  OL                        0.369  0.185  0.261  Cal                          0.365  0.574  JCup                            0.528  JLex                              View Large Table 4. Credibility of DKISB. Software  gzip  bzip2  zip  rar  lzip  feh  qiv  sxiv  pho  md5  OL  Cal  JCup  JLex  gzip    0.000  0.789  0.076  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  bzip2      0.000  0.175  0.025  0.066  0.059  0.279  0.024  0.000  0.000  0.000  0.160  0.000  zip        0.121  0.000  0.009  0.004  0.011  0.002  0.000  0.000  0.000  0.002  0.000  rar          0.000  0.004  0.024  0.006  0.007  0.000  0.000  0.006  0.002  0.006  lzip            0.155  0.065  0.080  0.073  0.000  0.000  0.000  0.014  0.000  feh              0.497  0.439  0.493  0.001  0.001  0.051  0.050  0.051  qiv                0.717  0.892  0.053  0.024  0.816  0.035  0.813  sxiv                  0.703  0.018  0.013  0.628  0.175  0.626  pho                    0.023  0.017  0.810  0.041  0.808  md5                      0.246  0.028  0.000  0.027  OL                        0.021  0.000  0.020  Cal                          0.057  0.998  JCup                            0.065  JLex                              Software  gzip  bzip2  zip  rar  lzip  feh  qiv  sxiv  pho  md5  OL  Cal  JCup  JLex  gzip    0.000  0.789  0.076  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  bzip2      0.000  0.175  0.025  0.066  0.059  0.279  0.024  0.000  0.000  0.000  0.160  0.000  zip        0.121  0.000  0.009  0.004  0.011  0.002  0.000  0.000  0.000  0.002  0.000  rar          0.000  0.004  0.024  0.006  0.007  0.000  0.000  0.006  0.002  0.006  lzip            0.155  0.065  0.080  0.073  0.000  0.000  0.000  0.014  0.000  feh              0.497  0.439  0.493  0.001  0.001  0.051  0.050  0.051  qiv                0.717  0.892  0.053  0.024  0.816  0.035  0.813  sxiv                  0.703  0.018  0.013  0.628  0.175  0.626  pho                    0.023  0.017  0.810  0.041  0.808  md5                      0.246  0.028  0.000  0.027  OL                        0.021  0.000  0.020  Cal                          0.057  0.998  JCup                            0.065  JLex                              View Large Table 4. Credibility of DKISB. Software  gzip  bzip2  zip  rar  lzip  feh  qiv  sxiv  pho  md5  OL  Cal  JCup  JLex  gzip    0.000  0.789  0.076  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  bzip2      0.000  0.175  0.025  0.066  0.059  0.279  0.024  0.000  0.000  0.000  0.160  0.000  zip        0.121  0.000  0.009  0.004  0.011  0.002  0.000  0.000  0.000  0.002  0.000  rar          0.000  0.004  0.024  0.006  0.007  0.000  0.000  0.006  0.002  0.006  lzip            0.155  0.065  0.080  0.073  0.000  0.000  0.000  0.014  0.000  feh              0.497  0.439  0.493  0.001  0.001  0.051  0.050  0.051  qiv                0.717  0.892  0.053  0.024  0.816  0.035  0.813  sxiv                  0.703  0.018  0.013  0.628  0.175  0.626  pho                    0.023  0.017  0.810  0.041  0.808  md5                      0.246  0.028  0.000  0.027  OL                        0.021  0.000  0.020  Cal                          0.057  0.998  JCup                            0.065  JLex                              Software  gzip  bzip2  zip  rar  lzip  feh  qiv  sxiv  pho  md5  OL  Cal  JCup  JLex  gzip    0.000  0.789  0.076  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  bzip2      0.000  0.175  0.025  0.066  0.059  0.279  0.024  0.000  0.000  0.000  0.160  0.000  zip        0.121  0.000  0.009  0.004  0.011  0.002  0.000  0.000  0.000  0.002  0.000  rar          0.000  0.004  0.024  0.006  0.007  0.000  0.000  0.006  0.002  0.006  lzip            0.155  0.065  0.080  0.073  0.000  0.000  0.000  0.014  0.000  feh              0.497  0.439  0.493  0.001  0.001  0.051  0.050  0.051  qiv                0.717  0.892  0.053  0.024  0.816  0.035  0.813  sxiv                  0.703  0.018  0.013  0.628  0.175  0.626  pho                    0.023  0.017  0.810  0.041  0.808  md5                      0.246  0.028  0.000  0.027  OL                        0.021  0.000  0.020  Cal                          0.057  0.998  JCup                            0.065  JLex                              View Large 4.3. Resiliency To evaluate the resiliency of WSB, it is verified from three aspects: resilient to compilers, resilient to obfuscators and resilient to packers. The three aspects are divided according to obfuscation strategies. 4.3.1. Resilient to compilers For the same source code, different compilers and compiler optimization levers (Opt-Lv) generate different versions of executable files. A lot of plagiarism cases are in this simple way. Bzip2 and pigz, two Com&Dec software, are chosen to simulate this plagiarism scene. They respectively have 12 versions that are complied with two compilers (llvm3.2 and gcc4.6.3) and multiple optimization parameters (debug, release and o1∼o3). Table 5 shows the similarities of different bzip2 versions. It is observed that all the similarities are higher than 0.9 and some of them even equal 1, indicating that WSB is resilient to different compilers and optimization parameters for bzip2. Table 6 are the similarities of different versions of pigz which is a multithreaded program. Although most of similarities are not higher than 0.9, the lower of them stay around 0.8. All the pairs in this scene can be judged as plagiarism successfully with appropriate thresholds ε. It proves that the WSB is resilient to different compilers and optimization parameters. Table 5. bzip2: Resilient to compilers. Compiler    llvm3.2  gcc4.6.3    Opt-Lv  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  llvm3.2  D-o1    0.955  0.952  0.998  0.955  0.951  0.997  0.953  0.952  0.998  0.956  0.953  D-o2      0.988  0.954  1.000  0.989  0.954  0.998  0.984  0.954  0.999  0.981  D-o3        0.951  0.988  0.999  0.951  0.986  0.995  0.951  0.989  0.991  R-o1          0.954  0.949  0.999  0.952  0.951  1.000  0.955  0.952  R-o2            0.988  0.954  0.998  0.984  0.954  0.999  0.981  R-o3              0.949  0.986  0.994  0.949  0.990  0.990  gcc4.6.3  D-o1                0.953  0.951  1.000  0.955  0.951  D-o2                  0.983  0.952  0.997  0.979  D-o3                    0.951  0.986  0.995  R-o1                      0.955  0.952  R-o2                        0.983  R-o3                          Compiler    llvm3.2  gcc4.6.3    Opt-Lv  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  llvm3.2  D-o1    0.955  0.952  0.998  0.955  0.951  0.997  0.953  0.952  0.998  0.956  0.953  D-o2      0.988  0.954  1.000  0.989  0.954  0.998  0.984  0.954  0.999  0.981  D-o3        0.951  0.988  0.999  0.951  0.986  0.995  0.951  0.989  0.991  R-o1          0.954  0.949  0.999  0.952  0.951  1.000  0.955  0.952  R-o2            0.988  0.954  0.998  0.984  0.954  0.999  0.981  R-o3              0.949  0.986  0.994  0.949  0.990  0.990  gcc4.6.3  D-o1                0.953  0.951  1.000  0.955  0.951  D-o2                  0.983  0.952  0.997  0.979  D-o3                    0.951  0.986  0.995  R-o1                      0.955  0.952  R-o2                        0.983  R-o3                          View Large Table 5. bzip2: Resilient to compilers. Compiler    llvm3.2  gcc4.6.3    Opt-Lv  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  llvm3.2  D-o1    0.955  0.952  0.998  0.955  0.951  0.997  0.953  0.952  0.998  0.956  0.953  D-o2      0.988  0.954  1.000  0.989  0.954  0.998  0.984  0.954  0.999  0.981  D-o3        0.951  0.988  0.999  0.951  0.986  0.995  0.951  0.989  0.991  R-o1          0.954  0.949  0.999  0.952  0.951  1.000  0.955  0.952  R-o2            0.988  0.954  0.998  0.984  0.954  0.999  0.981  R-o3              0.949  0.986  0.994  0.949  0.990  0.990  gcc4.6.3  D-o1                0.953  0.951  1.000  0.955  0.951  D-o2                  0.983  0.952  0.997  0.979  D-o3                    0.951  0.986  0.995  R-o1                      0.955  0.952  R-o2                        0.983  R-o3                          Compiler    llvm3.2  gcc4.6.3    Opt-Lv  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  llvm3.2  D-o1    0.955  0.952  0.998  0.955  0.951  0.997  0.953  0.952  0.998  0.956  0.953  D-o2      0.988  0.954  1.000  0.989  0.954  0.998  0.984  0.954  0.999  0.981  D-o3        0.951  0.988  0.999  0.951  0.986  0.995  0.951  0.989  0.991  R-o1          0.954  0.949  0.999  0.952  0.951  1.000  0.955  0.952  R-o2            0.988  0.954  0.998  0.984  0.954  0.999  0.981  R-o3              0.949  0.986  0.994  0.949  0.990  0.990  gcc4.6.3  D-o1                0.953  0.951  1.000  0.955  0.951  D-o2                  0.983  0.952  0.997  0.979  D-o3                    0.951  0.986  0.995  R-o1                      0.955  0.952  R-o2                        0.983  R-o3                          View Large Table 6. pigz: Resilient to compilers. Compiler    llvm3.2  gcc4.6.3    Opt-Lv  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  llvm3.2  D-o1    0.860  0.812  0.863  0.798  0.804  0.898  0.890  0.856  0.854  0.861  0.839  D-o2      0.909  0.856  0.880  0.918  0.850  0.849  0.934  0.796  0.810  0.845  D-o3        0.848  0.946  0.884  0.828  0.798  0.888  0.791  0.808  0.826  R-o1          0.893  0.895  0.814  0.802  0.844  0.854  0.855  0.826  R-o2            0.938  0.814  0.794  0.873  0.825  0.834  0.862  R-o3              0.811  0.811  0.894  0.832  0.834  0.873  gcc4.6.3  D-o1                0.940  0.877  0.909  0.906  0.878  D-o2                  0.883  0.890  0.903  0.885  D-o3                    0.825  0.840  0.886  R-o1                      0.971  0.915  R-o2                        0.915  R-o3                          Compiler    llvm3.2  gcc4.6.3    Opt-Lv  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  llvm3.2  D-o1    0.860  0.812  0.863  0.798  0.804  0.898  0.890  0.856  0.854  0.861  0.839  D-o2      0.909  0.856  0.880  0.918  0.850  0.849  0.934  0.796  0.810  0.845  D-o3        0.848  0.946  0.884  0.828  0.798  0.888  0.791  0.808  0.826  R-o1          0.893  0.895  0.814  0.802  0.844  0.854  0.855  0.826  R-o2            0.938  0.814  0.794  0.873  0.825  0.834  0.862  R-o3              0.811  0.811  0.894  0.832  0.834  0.873  gcc4.6.3  D-o1                0.940  0.877  0.909  0.906  0.878  D-o2                  0.883  0.890  0.903  0.885  D-o3                    0.825  0.840  0.886  R-o1                      0.971  0.915  R-o2                        0.915  R-o3                          View Large Table 6. pigz: Resilient to compilers. Compiler    llvm3.2  gcc4.6.3    Opt-Lv  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  llvm3.2  D-o1    0.860  0.812  0.863  0.798  0.804  0.898  0.890  0.856  0.854  0.861  0.839  D-o2      0.909  0.856  0.880  0.918  0.850  0.849  0.934  0.796  0.810  0.845  D-o3        0.848  0.946  0.884  0.828  0.798  0.888  0.791  0.808  0.826  R-o1          0.893  0.895  0.814  0.802  0.844  0.854  0.855  0.826  R-o2            0.938  0.814  0.794  0.873  0.825  0.834  0.862  R-o3              0.811  0.811  0.894  0.832  0.834  0.873  gcc4.6.3  D-o1                0.940  0.877  0.909  0.906  0.878  D-o2                  0.883  0.890  0.903  0.885  D-o3                    0.825  0.840  0.886  R-o1                      0.971  0.915  R-o2                        0.915  R-o3                          Compiler    llvm3.2  gcc4.6.3    Opt-Lv  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  llvm3.2  D-o1    0.860  0.812  0.863  0.798  0.804  0.898  0.890  0.856  0.854  0.861  0.839  D-o2      0.909  0.856  0.880  0.918  0.850  0.849  0.934  0.796  0.810  0.845  D-o3        0.848  0.946  0.884  0.828  0.798  0.888  0.791  0.808  0.826  R-o1          0.893  0.895  0.814  0.802  0.844  0.854  0.855  0.826  R-o2            0.938  0.814  0.794  0.873  0.825  0.834  0.862  R-o3              0.811  0.811  0.894  0.832  0.834  0.873  gcc4.6.3  D-o1                0.940  0.877  0.909  0.906  0.878  D-o2                  0.883  0.890  0.903  0.885  D-o3                    0.825  0.840  0.886  R-o1                      0.971  0.915  R-o2                        0.915  R-o3                          View Large 4.3.2. Resilient to obfuscators It is the easiest way to obfuscate software with compilers while a more complicated way uses advanced obfuscation techniques implemented in special tools. To evaluate the resiliency in this scene, JLex, JCup and Cal are chosen from the benchmark that DKISB used. There are 32 versions of JLex, 34 versions of JCup and 36 versions of Cal. And each version is generated by obfuscation tools listed in Table 2. The similarity of each obfuscated version and their original version is calculated. As is shown in Figs 5–7, the x-axis shows the obfuscated versions of each program and the y-axis gives the similarities of its obfuscated versions to its original version. Figure 5. View largeDownload slide JLex: Resilient to obfuscators. Figure 5. View largeDownload slide JLex: Resilient to obfuscators. Figure 6. View largeDownload slide JCup: Resilient to obfuscators. Figure 6. View largeDownload slide JCup: Resilient to obfuscators. Figure 7. View largeDownload slide Cal: Resilient to obfuscators. Figure 7. View largeDownload slide Cal: Resilient to obfuscators. Figures 5–7 show that the similarities between different versions of JLex and Cal are all higher than 0.8, and most of them are higher than 0.9. For JCup, there are only three pairs whose similarities are lower than 0.8 and the others are higher than 0.8. The results illustrate that most of pairs in this kind of plagiarism are able to be detected clearly. It proves that the proposed WSB is also resilient to advanced obfuscation techniques implemented in tools listed in Table 2 to some extent. 4.3.3. Resilient to packers To protect source code, most software will be packed before it is released. Correspondingly, it has also become an important form of plagiarism. Our experiments are conducted on linux, two common used linux packers as listed in Table 2 are selected. Upx is a compression packer, while MidgetPack is an encryption packer. They represent the two types of packer respectively. In Table 7, the resiliency of packers between WSB and DKISB is compared. As is shown in Table 7, the row shows the original version of software(S) and the column gives the packed version of it. There are three packed versions for each software. They are versions that packed using Upx, packed using MidgetPack with parameter curve ( Mi−c) and packed with parameter password ( Mi−p), respectively. Each value in the table is the similarity of the original version and the packed version of it. Each horizontal line in the table denotes that the similarity of this pair cannot be calculated. Table 7. Comparing the Resiliency to packers between WSB and DKISB. S  WSB  DKISB    Mi-c  Mi-p  Upx  Mi-c  Mi-p  Upx  bzip2  0.999  0.999  0.995  –  –  1.000  gzip  0.988  0.988  0.912  –  –  1.000  pigz  0.986  0.989  0.967  –  –  1.000  lzip  0.932  0.930  0.919  –  –  1.000  feh  0.920  0.927  0.900  –  –  1.000  pho  0.938  0.934  0.942  –  –  0.999  qiv  0.957  0.968  0.964  –  –  0.999  sxiv  0.940  0.923  0.874  –  –  0.999  md5  0.919  0.882  0.937  –  –  1.000  OL  0.967  0.977  0.971  –  –  1.000  Cal  0.871  0.871  0.883  –  –  1.000  S  WSB  DKISB    Mi-c  Mi-p  Upx  Mi-c  Mi-p  Upx  bzip2  0.999  0.999  0.995  –  –  1.000  gzip  0.988  0.988  0.912  –  –  1.000  pigz  0.986  0.989  0.967  –  –  1.000  lzip  0.932  0.930  0.919  –  –  1.000  feh  0.920  0.927  0.900  –  –  1.000  pho  0.938  0.934  0.942  –  –  0.999  qiv  0.957  0.968  0.964  –  –  0.999  sxiv  0.940  0.923  0.874  –  –  0.999  md5  0.919  0.882  0.937  –  –  1.000  OL  0.967  0.977  0.971  –  –  1.000  Cal  0.871  0.871  0.883  –  –  1.000  View Large Table 7. Comparing the Resiliency to packers between WSB and DKISB. S  WSB  DKISB    Mi-c  Mi-p  Upx  Mi-c  Mi-p  Upx  bzip2  0.999  0.999  0.995  –  –  1.000  gzip  0.988  0.988  0.912  –  –  1.000  pigz  0.986  0.989  0.967  –  –  1.000  lzip  0.932  0.930  0.919  –  –  1.000  feh  0.920  0.927  0.900  –  –  1.000  pho  0.938  0.934  0.942  –  –  0.999  qiv  0.957  0.968  0.964  –  –  0.999  sxiv  0.940  0.923  0.874  –  –  0.999  md5  0.919  0.882  0.937  –  –  1.000  OL  0.967  0.977  0.971  –  –  1.000  Cal  0.871  0.871  0.883  –  –  1.000  S  WSB  DKISB    Mi-c  Mi-p  Upx  Mi-c  Mi-p  Upx  bzip2  0.999  0.999  0.995  –  –  1.000  gzip  0.988  0.988  0.912  –  –  1.000  pigz  0.986  0.989  0.967  –  –  1.000  lzip  0.932  0.930  0.919  –  –  1.000  feh  0.920  0.927  0.900  –  –  1.000  pho  0.938  0.934  0.942  –  –  0.999  qiv  0.957  0.968  0.964  –  –  0.999  sxiv  0.940  0.923  0.874  –  –  0.999  md5  0.919  0.882  0.937  –  –  1.000  OL  0.967  0.977  0.971  –  –  1.000  Cal  0.871  0.871  0.883  –  –  1.000  View Large It is observed that all the similarities are higher than 0.8 for WSB. Some of them are even higher than 0.9. It indicates that WSB is resilient to the two packers. However, some of similarities cannot be calculated for DKISB. The reason is that the versions packed using MidgetPack need the key or password from outside input to decrypt the original program. When using DKISB, the DDFT module will track the key or password. The module obtains instructions not related to the original program. Furthermore, the analysis module will run out of the memory space because that the decryption process is too complicated. Therefore, the key instruction sequences cannot be obtained. The experiments use two packers while the DKISB is only resilient to Upx and does not work for MidgetPack. Comparing the two birthmarks, the proposed WSB resists not only Upx but also Midgetpack. 4.4. Discussion of parameters N and ε With different values of N which is used in n-gram method, the similarities of software pairs can be slightly different. It may affect the final decision. Therefore, it is necessary to search a suitable value of N. Besides, the choice of threshold ε is also important to the final decision. In this section, the choices of parameters N and ε are discussed. It tries to find the most suitable values of them. To comprehensively evaluate the performance of our proposed birthmark, an extend definition of precision, recall and F-Measure in [37] is adopted for the detection result given by our method may be inconclusive according to the formula (4). They are customized to the following definitions:   Precision=∣EP∩JP∣+∣EI∩JI∣∣JP∣+∣JI∣ (5)  Recall=∣EP∩JP∣+∣EI∩JI∣∣EP∣+∣EI∣ (6)  FM=2×Precision×RecallPrecision+Recall (7)where EP represents the set of comparison pairs that have plagiarism and JP represents the set that are judged plagiarism by our method. Similarly, EI represents the set that are independent and JP represents the set that are judged as independent by our method. As is mentioned above, the detection result of our approach relies on both parameter N and ε. By increasing ε from 0.05 to 0.5 in an increment of 0.05, we can correspondingly draw the precision, recall and FM curve with N varied from 2 to 8. All the test software come from Table 1 and their plagiaristic versions. There are totally 120 pairs of independent software and 2271 pairs of plagiarism software. Figures 8–10 show the precision, recall and FM curve of WSB without considering the imbalance of data. It is observed that the most suitable value of N is 3 and ε is 0.5. In our experiments, the maximum of FM is 0.994 in this situation. While the amounts of independent software pairs and plagiarism software pairs are imbalanced. According to the literature [38], we use undersampling method to divide the set of plagiarism software pairs into 19 subsets. Each set is used to calculate the precision, recall and FM with the set of independent software pairs and the averages of these precisions, recalls and FMs are used as the final precision, recall and FM. Figures 11–13 show the precision, recall and FM curve of WSB when data are balanced. It is observed that the most suitable value of N is also 3 and ε is also 0.5. The maximum of FM is 0.968 now. Figure 8. View largeDownload slide Precision of WSB when data are imbalanced. Figure 8. View largeDownload slide Precision of WSB when data are imbalanced. Figure 9. View largeDownload slide Recall of WSB when data are imbalanced. Figure 9. View largeDownload slide Recall of WSB when data are imbalanced. Figure 10. View largeDownload slide F-Measure of WSB when data are imbalanced. Figure 10. View largeDownload slide F-Measure of WSB when data are imbalanced. Figure 11. View largeDownload slide Precision of WSB when data are balanced. Figure 11. View largeDownload slide Precision of WSB when data are balanced. Figure 12. View largeDownload slide Recall of WSB when data are balanced. Figure 12. View largeDownload slide Recall of WSB when data are balanced. Figure 13. View largeDownload slide F-Measure of WSB when data are balanced. Figure 13. View largeDownload slide F-Measure of WSB when data are balanced. Moreover, two or more subsets of the 19 subsets can be merged into one subset before calculating the FM with the set of independent software pairs. And the average of maximum FMs can be calculated in different proportion of independent software pairs and plagiarism software pairs. Figure 14 shows the relationship between the maximum FM and the proportion. The x-axis represents the proportion and the y-axis shows the value of maximum FM. In this experiment, we choose the proportion 1:1,1:2,1:3,1:4,1:5,1:6,1:7,1:8,1:9,1:19 and calculate the maximum FM, respectively. When the data are more and more balanced, the FM value gradually decreases. In our experiments, the maximum value of FM is equal to 0.968 when data are balanced. It indicates that WSB can satisfy both the credibility and the resiliency of birthmark. Figure 14. View largeDownload slide The relationship between the maximum F-Measure and the proportion. Figure 14. View largeDownload slide The relationship between the maximum F-Measure and the proportion. 5. CONCLUSIONS Although most of existing dynamic birthmarks have high performance on resilient to obfuscation techniques, they are often not suitable for encryption packer. The decryption process also executes during the original program running, which will cause noise interference to the dynamic birthmark. Our WSB takes advantage of the different patterns of dynamic basic block replications to reduce the effect of these disturbances. It is resilient to not only obfuscation techniques but also encryption packer. Our experiments also show that the WSB is a high-quality birthmark which keeps both the credibility and resiliency. It’s even better than DKISB in some situations. However, as a dynamic software birthmark, the WSB still has the following limitations: The WSB can only be used for executable files, dynamic or static link library files are not suitable. This birthmark mainly considers the patterns of dynamic basic block call relationships and does not take the program statements and semantics into consideration. Therefore, this method is not applicable to component plagiarism. This birthmark is a fine-grained birthmark based on dynamic basic blocks. When the scale of the software is too large, the process of extracting birthmark is time-consuming for parsing the large log file. Our future research will focus more on the above issues. For library files, dynamic software birthmark does not work because library files cannot run independently. Component plagiarism detection is also hard to use dynamic software birthmark because it is difficult to locate the copied parts. Static software birthmark may be more suitable for these situations and symbolic execution may be used to enhance semantic analysis. To accommodate large-scale software, we can try to find an effective coarse-grained dynamic birthmark. FUNDING This work was supported by the National Key Research and Development Program (2016QY06X1205 and 2016YFB0800605), the National Natural Science Foundation of China (91338107) and Technology Research and Development Program of Sichuan, China (17ZDYF2583). ACKNOWLEDGEMENTS We thank anonymous reviewers for their valuable suggestions and comments, which help us to improve the quality of this paper. Footnotes 1 A binary packer for ELF binaries. https://github.com/arisada/midgetpack 2 It is a special case of pq-Gram in graph when p equals n−2 and q equals 1. 3 A second generation Java obfuscator, which offers a full spectrum of protection for your intellectual property. http://www.allatori.com. 4 DynamoRIO is a runtime code manipulation system that supports code transformations on any part of a program, while it executes. http://www.dynamorio.org/. REFERENCES 1 BSA. Unlicensed software use still high globally despite costly cybersecurity threats. http://globalstudy.bsa.org/2016/. 2 Revulytics. Top 20 countries for software piracy and license misuse ( 2017). https://www.revulytics.com/blog/top-20-countries-software-piracy-2017. 3 Lancaster, T. and Culwin, F. ( 2004) A comparison of source code plagiarism detection engines. Comput. Sci. Educ. , 14, 101– 112. Google Scholar CrossRef Search ADS   4 Ceccato, M., Di Penta, M., Nagra, J., Falcarin, P., Ricca, F., Torchiano, M. and Tonella, P. ( 2009) The Effectiveness of Source Code Obfuscation: An Experimental Assessment. IEEE Int. Conf. Program Comprehension, Vancouver, British Columbia, Canada, 17–19 May, pp. 178–187. IEEE Computer Society, Los Alamitos, CA. 5 Collberg, C. and Thomborson, C. ( 1999) Software Watermarking: Models and Dynamic Embeddings. Proc. 26th ACM SIGPLAN-SIGACT Symp. Principles of Programming Languages, San Antonio, TX, USA, 20–22 January, pp. 311–324. ACM, New York, NY, USA. 6 Collberg, C., Carter, E., Debray, S., Huntwork, A., Kececioglu, J., Linn, C. and Stepp, M. ( 2004) Dynamic path-based software watermarking. ACM Sigplan Not. , 39, 107– 118. Google Scholar CrossRef Search ADS   7 NAIST-IS-TR2003014 ( 2003) Detecting the Theft of Programs Using Birthmarks . Springer, New York. 8 Kim, D., Han, Y., Cho, S.-j., Yoo, H., Woo, J., Nah, Y., Park, M. and Chung, L. ( 2013) Measuring Similarity of Windows Applications Using Static and Dynamic Birthmarks. Proc. 28th Annu. ACM Symp. Applied Computing, Coimbra, Portugal, 18–22 March, pp. 1628–1633. ACM, New York, NY, USA. 9 Roundy, K.A. and Miller, B.P. ( 2013) Binary-code obfuscations in prevalent packer tools. Acm Comput. Surv. , 46, 1– 32. Google Scholar CrossRef Search ADS   10 Ugarte-Pedrero, X., Santos, I., Sanz, B., Laorden, C. and Bringas, P.G. ( 2012) Countering Entropy Measure Attacks on Packed Software Detection. Consumer Communications and Networking Conference (CCNC), 2012 IEEE, Planet Hollywood, Las Vegas, NV, USA, 14–17 January, pp. 164–168. IEEE Computer Society, Los Alamitos, CA. 11 Bai, Y., Sun, X., Sun, G., Deng, X. and Zhou, X. ( 2008) Dynamic k-Gram based Software Birthmark. 19th Austral. Conf. Software Engineering, Perth, WA, Australian, 26–28 March, pp. 644–649. IEEE Computer Society, Los Alamitos, CA. 12 Yuanpeng, S. ( 2012) Software protection method based on shell technology. Softw. Eng. Appl. , 01, 47– 53. Google Scholar CrossRef Search ADS   13 Tian, Z., Zheng, Q., Liu, T. and Fan, M. ( 2013) DKISB: Dynamic Key Instruction Sequence Birthmark for Software Plagiarism Detection. IEEE Int. Conf. High Performance Computing and Communications & 2013 IEEE Int. Conf. Embedded and Ubiquitous Computing, Zhangjiajie, China, 13–15 November, pp. 619–627. IEEE Computer Society, Los Alamitos, CA. 14 Augsten, N., Hlen, M. and Gamper, J. ( 2005) Approximate Matching of Hierarchical Data Using pq-Grams. Proc. 31st Int. Conf. Very Large Data Bases, Trondheim, Norway, 30 August–2 September, pp. 301–312. VLDB Endowment. 15 Grier, S. ( 1981) A tool that detects plagiarism in Pascal programs. ACM SIGCSE Bull. , 13, 15– 20. Google Scholar CrossRef Search ADS   16 Aiken, A. ( 1994) A system for detecting software plagiarism. Retrieved April, 1, 2010. 17 Wise, M.J. ( 1996) Yap3: improved detection of similarities in computer program and other texts. ACM SIGCSE Bull. , 28, 130– 134. Google Scholar CrossRef Search ADS   18 Prechelt, L., Malpohl, G. and Philippsen, M. ( 2002) Finding plagiarisms among a set of programs with jplag. J. UCS , 8, 1016. 19 Flores, E., Barrón-Cedeño, A., Rosso, P. and Moreno, L. ( 2012) Desocore: Detecting Source Code Re-use Across Programming Languages. Proc. 2012 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstration Session, Montreal, Canada, 3–8 June, pp. 1–4. Association for Computational Linguistics Stroudsburg, PA, USA. 20 Flores, E., Barrón-Cedeño, A., Moreno, L. and Rosso, P. ( 2015) Uncovering source code reuse in large-scale academic environments. Comput. Appl. Eng. Educ. , 23, 383– 390. Google Scholar CrossRef Search ADS   21 Flores, E., Barrón-Cedeño, A., Moreno, L. and Rosso, P. ( 2015) Cross-language source code re-use detection using latent semantic analysis. J. Universal Comput. Sci. , 21, 1708– 1725. 22 Flores, E., Rosso, P., Moreno, L. and Villatoro, E. ( 2014) Pan@fire: Overview of SOCO Track on the Detection of Source Code Re-use. Notebook Papers of FIRE 2014, FIRE-2014, Bangalore, India, 5–7 December, pp. 1–11. ACM, New York, NY, USA. 23 Flores, E., Rosso, P., Villatoro, E., Moreno, L., Alcover, R. and Chirivella, V. ( 2015) Pan@fire: Overview of CL-SOCO Track on the Detection of Cross-Language Source Code Re-use. Notebook Papers of FIRE 2015, FIRE-2015, Gandhinagar, India, 4–6 December, pp. 1–5. CEUR-WS.org. 24 Tamada, H., Nakamura, M., Monden, A. and Matsumoto, K.-I. ( 2004) Design and Evaluation of Birthmarks for Detecting Theft of Java Programs. IASTED Conf. Software Engineering, Innsbruck, Austria, 17–19 February, pp. 569–574. ACTA Press, Calgary, AB, Canada. 25 Myles, G. and Collberg, C. ( 2005) K-Gram Based Software Birthmarks. Proc. 2005 ACM Symp. Applied Computing, Santa Fe, New Mexico, 13–17 March, pp. 314–318. ACM, New York, NY, USA. 26 Lim, H.-i., Park, H., Choi, S. and Han, T. ( 2009) A method for detecting the theft of java programs through analysis of the control flow information. Inf. Softw. Technol. , 51, 1338– 1350. Google Scholar CrossRef Search ADS   27 Luo, L., Ming, J., Wu, D., Liu, P. and Zhu, S. ( 2014) Semantics-Based Obfuscation-Resilient Binary Code Similarity Comparison with Applications to Software Plagiarism Detection. Proc. 22nd ACM SIGSOFT Int. Symp. Foundations of Software Engineering, Hong Kong, China, 16–22, November, pp. 389–400. ACM, New York, NY, USA. 28 Myles, G. and Collberg, C. ( 2004) Detecting Software Theft via Whole Program Path Birthmarks. Lecture Notes in Computer Science, 3225, 404–415. 29 Wang, X., Jhi, Y.-C., Zhu, S. and Liu, P. ( 2009) Behavior Based Software Theft Detection. Proc. 16th ACM Conf. Computer and Communications Security, Chicago, IL, USA, 09–13, November, pp. 280–290. ACM, New York, NY, USA. 30 Chan, P.P., Hui, L.C. and Yiu, S.-M. ( 2011) Dynamic Software Birthmark for Java Based on Heap Memory Analysis. IFIP Int. Conf. Communications and Multimedia Security, Ghent, Belgium, 19–21, October, pp. 94–107. Springer-Verlag, Berlin, Heidelberg. 31 Tamada, H., Okamoto, K., Nakamura, M., Monden, A. and Matsumoto, K.-i. ( 2004) Dynamic Software Birthmarks to Detect the Theft of Windows Applications. In Proc. Int. Symp. Future Software Technology 2004, Xi’an, China, 20–22 October, pp. 1–6. ISFST, Wuhan, China. 32 Chae, D.-K., Kim, S.-W., Cho, S.-J. and Kim, Y. ( 2015) Daav: Dynamic API Authority Vectors for Detecting Software Theft. Proc. 24th ACM Int. Conf. Information and Knowledge Management, Melbourne, VIC, Australia, 19–23 October, pp. 1819–1822. ACM, New York, NY, USA. 33 Wang, X., Jhi, Y.-C., Zhu, S. and Liu, P. ( 2009) Detecting Software Theft via System Call Based Birthmarks. Computer Security Applications Conference, 2009. ACSAC’09. Annual, Honolulu, HI, 7–11 December, pp. 149–158. IEEE Computer Society, Los Alamitos, CA. 34 Jhi, Y.-C., Wang, X., Jia, X., Zhu, S., Liu, P. and Wu, D. ( 2011) Value-Based Program Characterization and Its Application to Software Plagiarism Detection. 2011 33rd Int. Conf. Software Engineering (ICSE), Waikiki, Honolulu, Hawaii, 21–28, May, pp. 756–765. IEEE Computer Society, Los Alamitos, CA. 35 Zhang, F., Wu, D., Liu, P. and Zhu, S. ( 2014) Program Logic Based Software Plagiarism Detection. 2014 IEEE 25th Int. Symp. Software Reliability Engineering (ISSRE), Naples, Italy, 3–6, November, pp. 66–77. IEEE Computer Society, Los Alamitos, CA. 36 Lee, D., Choi, Y., Jung, J., Kim, J. and Won, D. ( 2015) An efficient categorization of the instructions based on binary executables for dynamic software birthmark. Int. J. Inf. Educ. Technol. , 5, 571. 37 Tian, Z., Zheng, Q., Liu, T., Fan, M., Zhang, X. and Yang, Z. ( 2014) Plagiarism Detection for Multithreaded Software Based on Thread-Aware Software Birthmarks. Proc. 22nd Int. Conf. Program Comprehension, Hyderabad, India, 31 May–07 June, pp. 304–313. ACM, New York, NY, USA. 38 He, H. and Garcia, E.A. ( 2009) Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. , 21, 1263– 1284. Google Scholar CrossRef Search ADS   Author notes Handling editor: Albert Levi © The British Computer Society 2018. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png The Computer Journal Oxford University Press

A New Software Birthmark based on Weight Sequences of Dynamic Control Flow Graph for Plagiarism Detection

Loading next page...
 
/lp/ou_press/a-new-software-birthmark-based-on-weight-sequences-of-dynamic-control-Aq00mgMlxK
Publisher
Oxford University Press
Copyright
© The British Computer Society 2018. All rights reserved. For Permissions, please email: journals.permissions@oup.com
ISSN
0010-4620
eISSN
1460-2067
D.O.I.
10.1093/comjnl/bxy055
Publisher site
See Article on Publisher Site

Abstract

Abstract With the large-scale development of open source software, software plagiarism has become a serious threat to software industry and intellectual property. As the latest technique of plagiarism detection, dynamic software birthmark has attracted much attention in recent years. Most of the existing dynamic birthmarks focus on how to resist obfuscation techniques such as compiler optimizations and strong obfuscations implemented in tools. However, they pay little attention to packers, especially encryption packer which is commonly used in software protection as well as plagiarism. When used to encrypt software, the decryption code is added to the binary. It is hard to distinguish the original parts of software from the decryption parts using traditional dynamic birthmarks. In this paper, we propose a new dynamic software birthmark called weight sequences birthmark (WSB) which is based on weight sequences of dynamic control flow graph (DCFG). The weight sequences are used as characteristics, which make full use of the different patterns of dynamic basic block replications between the original code and the decryption code. Compared with the-state-of-art dynamic key instruction sequence birthmark (DKISB), the new birthmark can resist encryption packer effectively. Furthermore, WSB shows better credibility than DKISB when distinguishing independent programs. The comprehensive experiments illustrate that the value of extended F-measure can reach 96.8%, indicating that it is a high-quality birthmark which satisfies both the credibility and the resiliency. 1. INTRODUCTION Software plagiarism, also known as software piracy, is the illegal copying, distribution or use of unlicensed software. According to the Business Software Alliance’s 2016 global software survey, 39% of software installed on PCs around the world in the past year was not properly licensed and their commercial value was $52.2 billion [1]. The piracy and misuse of software license in China has always ranked first among the top 20 countries since 2015 [2], according to the revulytics software piracy statistics in 2017. Software plagiarism has become a serious threat to software industry and intellectual property. Plagiarism detection calls for effective methods to protect the software copyright property and mitigate such economic losses. The plagiarism detection methods can be divided into three categories: source code based, watermark based and birthmark based. Source code-based detection usually uses the methods of text similarity or structural analysis rarely involving semantic analysis [3]. While source code is often protected by obfuscation techniques [4], the anti-obfuscation of this method is weak. When the software is distributed in binary form that leads to the lack of source code, this method does not work at all. Software watermark is the most well-known and earliest approach to software protection [5]. It is implemented by embedding a unique identifier (watermark) into the protected software before it is released. The watermark is hard to remove but easy to verify, so it serves as strong evidence for plagiarism judgment. However, it has been proven by Collberg et al. that ‘a sufficiently determined attacker will eventually be able to defeat any watermark’ [6]. Birthmark-based scheme extracts the unique characteristics of software and then the similarity of them is measured to identify plagiarism [7]. Software birthmark is superior to the other two methods because it is extracted from software directly without requiring source code and extra data. Depending on extracting schemes, the birthmark can be categorized as static birthmark and dynamic birthmark [8]. Static birthmark comes from disassembler or source code, while dynamic birthmark is extracted from runtime information obtained by monitoring execution. The static birthmark is hard to deal with the obfuscated software whose disassembler code structure and content may be changed [9]. Furthermore, the static birthmark is invalid when the software cannot be disassembled after been obfuscated or packed [10]. While dynamic software birthmark can bypass the problem of disassembly and has effective performance of anti-obfuscation [11]. But most of software are protected by shell technology [12] that makes original program blend into the unpack program. When extracting birthmark dynamically, the characteristics such as instructions, APIs and system calls from the unpack program are difficult to eliminate. Most of current dynamic birthmarks ignore the problem how to filter out these irrelevant characteristics. As far as we know, DKISB [13] is the only one birthmark which uses dynamic data flow tracking (DDFT) to filter out information unrelated to the input. While in the case of using a password or key to encrypt such as MidgetPack,1 DKISB is also invalid. Because the DDFT module also tracks the key or password which does not belong to the original program input. Furthermore, the module will run out of the memory space for the complicated decryption process. In this paper, to solve the above problems, we propose a new dynamic software birthmark called weight sequences birthmark (WSB). The memory consumption of WSB is smaller than DKISB because it only records the branch jump address and the first address of basic block. After recording, the log records are parsed offline to construct the weighted dynamic control flow graph (DCFG). The basic blocks are extracted as nodes marked with the first address. The basic block call relationships are extracted as edges which come from branch jump records. And the call relationship repetitions are extracted as the edge weights. To avoid comparing the graph similarity directly which may come up against the subgraph isomorphism problem, graph characteristics [14] are extracted before comparing the graph similarity. We traverse the graph and obtain a n-gram2 set of weight sequences as the characteristics of software. The n-gram set is converted into birthmark by merging the numerically similar sequences into one item as a dimension of the characteristic vector. The value of each dimension represents the frequency of this kind of sequence. Finally, the extended cosine of two vectors is calculated as the similarity of two birthmarks. It estimates plagiarism according to similarity and pre-defined threshold. The contributions of this paper are summarized as follows: Weight sequences of DCFG, a novel character representation of software behaviors, are extracted to construct new dynamic software birthmark. Utilizing the different patterns of dynamic basic block replications, the new birthmark is still effective when the program is encrypted by the packer. Compared with DKISB, the new birthmark shows better credibility when distinguishing independent programs. And its resiliency and credibility evaluated on extensive experiments prove that it is a high-quality birthmark. The rest of this paper is organized as follow: Section 2 reviews related work in the field of software plagiarism detection. The overview of the framework is presented in Section 3 first. Then it introduces the main points of WSB including the specific definition, the construction of weighted DCFG, the extraction process of WSB, the extended cosine similarity algorithm as well as how to make final decision. The performance of WSB is evaluated and compared with DKISB through extensive experiments in Section 4. Finally, conclusions are drawn in Section 5. 2. RELATED WORK There are various studies of software plagiarism detection. The earliest studies focus on source code which is considered as text with specific grammatical structure. For instance, Grier [15] counted the total number of variables, input statements, conditional statements, loop statements, assignment statements and the call numbers as the attributes for Pascal programs plagiarism detection. Another earlier study, software watermark, was proposed by Collberg and Thomborson [5]. In this method, the watermark was embedded into the executable before its release. It could be extracted from executable files to identify whether two programs were homologous. The methods, such as MOSS [16] and YAP3 [17], were based on token sequences or syntactic trees. And JPLAG [18] constructed the program dependency graph (PDG) from source code. Flores et al. [19] recently proposed a series of methods on source code re-use. DeSoCoRe was a source code re-use detection tool based on techniques of Natural Language Processing. This tool could provide an understandable output to the human reviewer. In [20], the authors presented an approach based on the comparison of programs at character level which could find potential cases of re-use across a huge number of assignments. They proved that this approach was better than JPLAG. They also compared a Latent Semantic Analysis (LSA) approach [21] with previously used text re-use detection models for measuring cross-language similarity in source code. The experiments showed that the LSA-based approach was slightly better than the other models. Moreover, shared tasks on plagiarism detection in source code were recently organized [22, 23]. The concept of software birthmark was put forward first by Tamada et al. [7, 24], which was composed of four individual birthmarks: constant values in field variables (CVFV), sequence of method calls (SMC), inheritance structure (IS) and used classes (UC). But it was also based on source code. To overcome the fundamental limitations of depending on extra data such as source code, many software birthmarks based on the binary code were proposed. All birthmarks can be divided into two main categories: static birthmark and dynamic birthmark. Static birthmark comes from disassembler or source code. Myles and Collberg [25] proposed a k-gram-based static birthmark for Java. It was extracted by sliding a window of length k over the static instruction sequences. Compared with Tamada’s birthmark, their birthmark showed better robustness and did not need source code. Another birthmark based on the control flow information was proposed by Lim et al. [26]. A set of behaviors of flow paths was used as birthmark. This birthmark was effective even in cases where such software was aggressively modified. However, it was still susceptible to simple code obfuscation techniques such as opaque branch implantation, basic block splitting and garbage instruction implantation. Cop [27] was a binary-oriented, obfuscation resilient prototype using birthmark based on the concept, the longest common subsequence of semantically equivalent basic blocks, which were modeled by a set of symbolic formulas representing the input–output relations of the block. It had stronger resiliency to code obfuscation techniques as well as other semantics-preserving transformations. While the huge overhead introduced by symbolic execution and constraint solving made Cop difficult to handle large-scale software. Dynamic birthmark is extracted during execution. Compared with static birthmark, it reflects the actual behaviors of the program. Therefore, it is more resilient to code obfuscation. Dynamic birthmark can be divided into three categories: graph-based, sequence-based and set-based. Graph-based: The concept of dynamic software birthmark was also proposed by Myles and Collberg [28]. They presented a graph-based birthmark, whole program path (WPP) that was in the form of a directed acyclic graph, which uniquely identifies a program based on a complete control flow trace of execution. They verified that it could be used to identify program theft even when an embedded watermark was destroyed by program transformation. A similar graph-based software birthmark called system call dependence graph (SCDG) was proposed by Wang et al. [29]. The SCDG was defined by using the system call as vertices and the dependences of system calls as edges. They demonstrated that SCDG has the capacity of detecting component theft where only partial code is stolen and it is resilient to various evasion techniques. But birthmarks generated from control flow traces or system calls were still vulnerable to obfuscation attack such as loop transformation and system calls replacement. Another graph-based dynamic birthmark was proposed by Patrick et al. [30]. They made use of the heap memory at runtime to construct object reference graph (ORG) as birthmark. They verified that their birthmark remained intact even after the testing software were obfuscated by the state-of-the-art Allatori obfuscator.3 However, the similarity of graph was usually based on subgraph isomorphism algorithm which was a NP-complete problem. Sequence-based: In [31], Tamada et al. proposed a method for collecting history of API function calls during execution. And they extracted two types of birthmarks from the log of calls. One was EXESEQ (sequence of API function calls) and the other was EXEFREQ (frequency of API function calls). Their analysis showed that these birthmarks have good tolerance against various kinds of program transformation attacks. DAAV [32] was also a dynamic birthmark based on API. The dynamic API call graph (DACG) was first constructed and then birthmark was extracted form DACG by converting it into vector with random walk. In [33], two system call-based software birthmarks, System Call Short Sequence Birthmark (SCSSB) and Input Dependant System Call Subsequence Birthmark (IDSCSB), were proposed by Wang et al. The SCSSB was extracted by removing the commonly found short sequences from the system call short sequences. And the IDSCSB was obtained from two system call sequences by excluding the same part which appears in both sequences with two different inputs. SCSSB and IDSCSB were suitable for partial code theft. In addition to making use of APIs or system calls, the values appeared during execution could also be used to build birthmarks. In [34], Jhi et al. proposed an approach based on core values and implemented a prototype with a dynamic taint analyzer atop a generic processor emulator. They defined core value that it should be output of a value-updating instruction and closely related to program semantics. Their experiments showed that this birthmark was resilient to various control and data obfuscation techniques. LoPD [35] was another kind of sequence-based birthmark which used strict formal logic analysis to capture program semantics. Although LoPD could effectively resist various semantic retention code confusions due to strict semantic equivalence, it was constrained by the weakness of symbol execution and constraint solving. Set-based: In [11], a dynamic k-gram-based birthmark was proposed. They used the k-gram set of instruction sequences as the unique characteristics. It proved that this birthmark was more resilient to semantics-preserving transformations than the static k-gram birthmark. However, as the scale of software increases, the dynamic instruction sequences set becomes larger, which will make it more difficult to deal with. To alleviate this problem, Lee et al. [36] proposed a dynamic birthmark based on categorization of instructions. It reduced the volume of characteristics set without damaging the meaning of sequences. During execution, the commonly used instructions accounted for the majority, which made distinguishing independent software difficult. Besides, it would introduce irrelevant instructions when the software was packed for self-protection. In order to find the sets of instruction sequences that were only closely related to program function, Tian et al. [13] proposed DKISB that could filter out the unrelated instructions. It used DDFT and only recorded key instructions that were related to program input. This birthmark was resilient to various kinds of semantic-preserving code obfuscation techniques. However, when dealing with the software packed by MidgetPack, it would lead to run out of memory and unable to extract correct instructions. 3. METHOD In this section, the framework overview of software plagiarism detection based on WSB is first introduced along with some key definitions. Then the conceptions of software birthmark and dynamic software birthmark are listed. And the detail descriptions of each step of WSB are followed. 3.1. Overview The framework consists of three main steps as is shown in Fig. 1. In the first step, DynamoRIO4 is used to record the first address of basic blocks as well as the branch jumps and then outputs the log trace. In the second step, we construct the weighted DCFG from the log. A node n in the graph is a dynamic basic block and an edge n0→n1 represents that node n1 is executed immediately after node n0. The weight in the graph represents the repetitions of this edge which is the number of times it appears in the log. Then the weight sequences are extracted by traversing the graph and converted into birthmark. These weight sequences can be considered as a set of n-grams denoted as ⟨W⃗,V⟩, where W⃗=w⃗(w0,w1,…,wn) is the weight sequence and V is the frequency accumulation of the n-gram items with same W⃗. For the simplified representation, the W⃗ can be convert into polar coordinate K=(r,α) according to the formula as follows:   {r=∣w⃗∣;Cosine(α)=w⃗·e⃗∣w⃗∣∣e⃗∣. (1)where e⃗ is the unit vector. In this step, the Weighted DCFG and Weight Sequences Birthmark are defined which are the key definitions in our method. Definition 3.1 (Weighted DCFG) The Weighted Dynamic Control Flow Graph of a program is based on the granularity of dynamic basic block and is a 3-tuple directed graph WDCFG=(N,E,W), where N, E, Wsatisfy the flowing conditions: N is a set of nodes, where node n∈N represents a dynamic basic block. E is a set of edges, where edge e(n0→n1)∈E represents that n1 is executed immediately after n0 is executed. W is a set of weights, where weight w∈W corresponds to one edge and it represents the repetitions of this edge during the program execution. Definition 3.2 (Weight Sequences Birthmark) The WSB is a dynamic software birthmark based on weight sequences of the graph defined in Definition3.1. It is a set denoted as ⟨K,V⟩={⟨k0,v0⟩,⟨k1,v1⟩,…,⟨kn,vn⟩}, where Vis frequency accumulation of the n-gram items with the same Kand Kis a polar coordinate point with the angle to unit vector and Euclidean distance of the item. Figure 1. View largeDownload slide Software Plagiarism Detection Framework based on WSB. Figure 1. View largeDownload slide Software Plagiarism Detection Framework based on WSB. In the third step, the similarity of birthmark is measured based on the extended cosine algorithm. According to the similarity value and threshold ε, a decision is made on whether plagiarism exists. 3.2. Software birthmark According to [7], the definition of software birthmark is: Definition 3.3 (Software Birthmark) Let p, qbe programs and ≡cpbe a given copy relation. Let f(p)be a set of characteristics extracted from pby a certain method f. Then f(p)is called a birthmark of punder ≡cpiff both of the following conditions are satisfied. Condition 1 f(p) is obtained only from p itself (without any extra information). Condition 2 p≡cp q⇒f(p)=f(q). Condition 1 means that the birthmark is the inherent characteristics of program and is not extra information. Regardless of the way in which the birthmark is extracted, the input of the method f is only the program itself. Hence, it is different from watermark that requires extra data. Condition 2 states that the same birthmark should be obtained from copied programs. By the contraposition, if birthmarks f(p) and f(q) are different, then p≡cp q does not hold, indicating that q is not a copy of p. Condition 2 reflects the property that birthmark should be against all scenes of plagiarism. In fact, it is hard to find such strong birthmark that perfectly satisfies this property. On the other hand, if two software are developed independently, their birthmarks should be different. The birthmark should not only be able to identify plagiarism but also be able to distinguish different software. And the concept of dynamic software birthmark proposed by Myles and Collberg [28] is defined as follows: Definition 3.4 (Dynamic Software Birthmark) Let p, qbe two programs and ibe an input to these programs. Let fbe a method for extracting a set of characteristics from a program. Then f(p,i)is a dynamic birthmark of piff: Condition 1 f(p,i)is obtained only from pitself by executing pwith the given input i. Condition 2 qis a copy of p⇒f(p,i)=f(q,i). Compared with the definitions of software birthmark, dynamic birthmark is closely related to the runtime environment. 3.3. The Proposed WSB 3.3.1. Dynamic instrumentation To generate log trace, DynamoRIO is used. It allows arbitrary modifications to application instructions via a powerful library. Depending on this tool, we implement a plug-in program to record the first address of basic block and the branch jump that occurs at the end of the basic block when encountering branch instructions such as jne and je. With a specific input, the test program executes on DynamoRIO along with the plug-in program. For each thread, there is a separate log file to record its branch jumps. Therefore, multithreaded program generates several logs. 3.3.2. Weighted DCFG construction After log is generated, it is parsed to construct the Weighted DCFG. Figure 2 is a simplified schematic diagram of Weighted DCFG. The nodes A∼H are basic blocks recorded during execution. The arrow from A to B is an edge of the Weighted DCFG, and w1 is the weight of this edge. There are eight nodes, nine edges and nine weights in total in the Diagram. Although the ring does not occur in this diagram, it exists because the repetitions of dynamic basic blocks are very common. Figure 2. View largeDownload slide Schematic Diagram of Weighted DCFG. Figure 2. View largeDownload slide Schematic Diagram of Weighted DCFG. Nodes and edges are extracted from the log and the weight of each edge is also counted. As is shown in Fig. 3, for multithread program, each log is parsed separately. Then the results are aggregated. After extraction, the intermediate results are output as a XML text. In the XML text, the Weighted DCFG is represented by adjacency graph. Figure 3. View largeDownload slide The WDCFG Construction. Figure 3. View largeDownload slide The WDCFG Construction. 3.3.3. WSB extraction Although the graph can be considered as birthmark, the subgraph isomorphism is a NP-complete problem. The nodes of Weighted DCFG are generally as many as thousands. Therefore, the graph-based similarity algorithm is not applicable to Weighted DCFG. Considering that in order to achieve a specific function, some basic operations implemented by basic blocks cannot be arbitrarily lost. Although some additional operations may be added to obfuscate the original operations, the obfuscation operations are small parts of total operations and the majorities are still basic operations. The patterns between the two kinds of operations are different. To take advantage of the patterns, we focus on the weight of Weighted DCFG and extract the sequences of weight to indirectly represent the characteristics of software. To extract the sequences of weight, the depth-first search algorithm is used to traverse Weighted DCFG. A n-gram set is then generated by sliding from the start of the sequence to the end of it with a constant length N [11]. Figure 4 shows the process of extracting the n-gram items from the schematic diagram in Fig. 2. In this case, the length of N equals 4. The results are: (w1,w2,w3,w4), (w2,w3,w4,w5), (w3,w4,w5,w7), (w4,w5,w7,w9), (w2,w3,w4,w6), (w3,w4,w6,w9) and (w1,w8,w7,w9). Figure 4. View largeDownload slide The Process of Extracting N-gram Items. Figure 4. View largeDownload slide The Process of Extracting N-gram Items. To improve the resiliency of birthmark and to simplify the representation, the key of each item is expressed in polar coordinates and the items with the same key are merged. Finally, the merged set becomes WSB. 3.3.4. Similarity measurement In the literature of plagiarism detection based on birthmark, the decision is made according to the similarity and the threshold ε. For different birthmarks, the scope of ε varies. To measure the similarity of WSB, it is treated as vector with each key as a dimension. Then extended cosine value is calculated to reflect the similarity of two WSBs. It is defined as follows: Definition 3.5 (The Extended Cosine Similarity of WSB) For two WSBs A={⟨k0,v0⟩,⟨k1,v1⟩,…,⟨kn,vn⟩}and B={⟨k0′,v0′⟩,⟨k1′,v1′⟩,…,⟨km′,vm′⟩}, let S=KeySet(A)∪KeySet(B). Then the vector a⃗=(a0,a1,…,ar)is constructed according to the following rules:   ai={vi,si∈KeySet(A);0,si∉KeySet(A). (2)where 0≤i≤rand viis the value of key siin A. Likewise the vector b⃗=(b0,b1,…,br)can be constructed. Then the extended cosine similarity of a⃗and b⃗can be calculated with the following formula:   Ex−Cosine(a⃗,b⃗)=a⃗·b⃗∣a⃗∣∣b⃗∣×θ,θ=min{∣a⃗∣,∣b⃗∣}max{∣a⃗∣,∣b⃗∣} (3)The result of Ex−Cosine(a⃗,b⃗) is the similarity of WSB. 3.3.5. Making decision The primary purpose of extracting birthmarks and measuring their similarity is to judge plagiarism. To exclude the influence of random factors as much as possible, detecting a pair of programs requires multiple experiments conducted with different inputs. The average similarity of these experiments is compared with a pre-defined threshold ε to make the final decision according the following rules:   AvgSim(PA,PB)={≥1−ε,copied;≤ε,independent;otherwise,inconclusive. (4) When the average similarity is greater than 1−ε, they are classified as ‘copied’. And when it is less than ε, they are classified as ‘independent’. Otherwise, the pair is inconclusive. With ε in the range of 0.05∼0.50 and N from 2 to 8, the extended precision, recall and F-Measure defined in [37] are tested to comprehensively evaluate the proposed birthmark in our experiments. When ε equals 0.5 and N equals 3, the F-Measure reaches the maximum. More detailed descriptions of experiments are given in the next section. 4. EVALUATION AND EXPERIMENT To evaluate the performance of the proposed birthmark, extensive experiments have been conducted to verify the credibility and the resiliency of it. In this section, we introduce the evaluation criteria of software birthmark and the dataset used in our experiments first. Then experiments are designed from two aspects to illustrate the advantages of the proposed birthmark. Finally, we discuss the parameters N and ε in our experiments. 4.1. Evaluation criteria and experiment dataset A high-quality birthmark should not only be able to identify plagiarism but also be able to distinguish independent software. In the literature [25], it is expressed as the following properties: Resiliency Let p be a program and p′ be a derivative version generated by applying semantic-preserving code transformations τ to p. Then we say the birthmark Bp is resilient to τ if sim(p,p′)≥1−ε. Credibility Let p and q be independently developed programs which may accomplish the same task. Then we say the birthmark Bp is credible if sim(p,q)≤ε. According to the criteria, extensive experiments are conducted to prove the performance of WSB. Most of the test software come from the DKISB [13] used. In addition to their obfuscated versions, we also test the versions that are packed by MidgetPack. Table 1 shows the dataset of our experiments without their plagiarism versions. Table 2 is the obfuscation strategies and tools used in experiments. Table 1. Dataset of test software. Category  Software  Version  Com&Dec  bzip2  1.0.6  zip  3.0  rar  5.0  lzip  1.13-rc2  tar  1.27  pigz  2.3  Image Process  qiv  2.2.4  feh  2.2  pho  0.9.8  sxiv  1  Enc&Dec  md5  8.13  OL  1.0.1  Java Program  JLex  1.4  JCup  0.1  Cal  0.1  Category  Software  Version  Com&Dec  bzip2  1.0.6  zip  3.0  rar  5.0  lzip  1.13-rc2  tar  1.27  pigz  2.3  Image Process  qiv  2.2.4  feh  2.2  pho  0.9.8  sxiv  1  Enc&Dec  md5  8.13  OL  1.0.1  Java Program  JLex  1.4  JCup  0.1  Cal  0.1  View Large Table 1. Dataset of test software. Category  Software  Version  Com&Dec  bzip2  1.0.6  zip  3.0  rar  5.0  lzip  1.13-rc2  tar  1.27  pigz  2.3  Image Process  qiv  2.2.4  feh  2.2  pho  0.9.8  sxiv  1  Enc&Dec  md5  8.13  OL  1.0.1  Java Program  JLex  1.4  JCup  0.1  Cal  0.1  Category  Software  Version  Com&Dec  bzip2  1.0.6  zip  3.0  rar  5.0  lzip  1.13-rc2  tar  1.27  pigz  2.3  Image Process  qiv  2.2.4  feh  2.2  pho  0.9.8  sxiv  1  Enc&Dec  md5  8.13  OL  1.0.1  Java Program  JLex  1.4  JCup  0.1  Cal  0.1  View Large Table 2. Obfuscation strategies and tools. Strategies  Tools  Compiler  llvm3.2 and gcc4.6.3  Obfuscator  Allatori,JShrink,ProGuard,KlassMaster  Packer  Upx,MidgetPack  Strategies  Tools  Compiler  llvm3.2 and gcc4.6.3  Obfuscator  Allatori,JShrink,ProGuard,KlassMaster  Packer  Upx,MidgetPack  View Large Table 2. Obfuscation strategies and tools. Strategies  Tools  Compiler  llvm3.2 and gcc4.6.3  Obfuscator  Allatori,JShrink,ProGuard,KlassMaster  Packer  Upx,MidgetPack  Strategies  Tools  Compiler  llvm3.2 and gcc4.6.3  Obfuscator  Allatori,JShrink,ProGuard,KlassMaster  Packer  Upx,MidgetPack  View Large 4.2. Credibility To evaluate the credibility of WSB, the similarities of all pairs in Table 1 are measured. There are in total 16 independent programs which can be divided into four categories: compression&decompression (Com&Dec), image process, encryption&decryption (Enc&Dec) and Java program. In Table 3, 14 of them are chosen to show the similarities of independent programs. It is observed that most of similarities are less than 0.5 except for several counterexamples. The low similarities prove that the proposed WSB is credible. The maximum similarity is 0.7 in this table. It is because that pho and qiv both belong to image process software and they share many image processing libraries. To compare with DKISB, the similarities are also calculated using DKISB in the same condition. And the results are illustrated in Table 4. The similarities of image processing software in DKISB are higher than that in WSB. Moreover, the similarity of Cal and JLex in DKISB is up to 0.998. It is very close to 1 which can make the decision exactly the opposite. However, it is only 0.547 in WSB, indicating that the performance on credibility of WSB is the better. Table 3. Credibility of WSB. Software  gzip  bzip2  zip  rar  lzip  feh  qiv  sxiv  pho  md5  OL  Cal  JCup  JLex  gzip    0.055  0.475  0.214  0.180  0.164  0.063  0.157  0.047  0.517  0.275  0.185  0.078  0.113  bzip2      0.069  0.114  0.116  0.111  0.184  0.177  0.125  0.041  0.087  0.190  0.316  0.319  zip        0.341  0.287  0.251  0.095  0.240  0.081  0.312  0.376  0.282  0.117  0.176  rar          0.372  0.346  0.149  0.324  0.121  0.206  0.413  0.363  0.173  0.258  lzip            0.177  0.080  0.187  0.069  0.141  0.319  0.255  0.121  0.210  feh              0.258  0.555  0.174  0.162  0.355  0.565  0.254  0.383  qiv                0.327  0.700  0.062  0.150  0.253  0.570  0.361  sxiv                  0.277  0.143  0.363  0.591  0.372  0.530  pho                    0.046  0.142  0.226  0.508  0.345  md5                      0.235  0.171  0.069  0.102  OL                        0.369  0.185  0.261  Cal                          0.365  0.574  JCup                            0.528  JLex                              Software  gzip  bzip2  zip  rar  lzip  feh  qiv  sxiv  pho  md5  OL  Cal  JCup  JLex  gzip    0.055  0.475  0.214  0.180  0.164  0.063  0.157  0.047  0.517  0.275  0.185  0.078  0.113  bzip2      0.069  0.114  0.116  0.111  0.184  0.177  0.125  0.041  0.087  0.190  0.316  0.319  zip        0.341  0.287  0.251  0.095  0.240  0.081  0.312  0.376  0.282  0.117  0.176  rar          0.372  0.346  0.149  0.324  0.121  0.206  0.413  0.363  0.173  0.258  lzip            0.177  0.080  0.187  0.069  0.141  0.319  0.255  0.121  0.210  feh              0.258  0.555  0.174  0.162  0.355  0.565  0.254  0.383  qiv                0.327  0.700  0.062  0.150  0.253  0.570  0.361  sxiv                  0.277  0.143  0.363  0.591  0.372  0.530  pho                    0.046  0.142  0.226  0.508  0.345  md5                      0.235  0.171  0.069  0.102  OL                        0.369  0.185  0.261  Cal                          0.365  0.574  JCup                            0.528  JLex                              View Large Table 3. Credibility of WSB. Software  gzip  bzip2  zip  rar  lzip  feh  qiv  sxiv  pho  md5  OL  Cal  JCup  JLex  gzip    0.055  0.475  0.214  0.180  0.164  0.063  0.157  0.047  0.517  0.275  0.185  0.078  0.113  bzip2      0.069  0.114  0.116  0.111  0.184  0.177  0.125  0.041  0.087  0.190  0.316  0.319  zip        0.341  0.287  0.251  0.095  0.240  0.081  0.312  0.376  0.282  0.117  0.176  rar          0.372  0.346  0.149  0.324  0.121  0.206  0.413  0.363  0.173  0.258  lzip            0.177  0.080  0.187  0.069  0.141  0.319  0.255  0.121  0.210  feh              0.258  0.555  0.174  0.162  0.355  0.565  0.254  0.383  qiv                0.327  0.700  0.062  0.150  0.253  0.570  0.361  sxiv                  0.277  0.143  0.363  0.591  0.372  0.530  pho                    0.046  0.142  0.226  0.508  0.345  md5                      0.235  0.171  0.069  0.102  OL                        0.369  0.185  0.261  Cal                          0.365  0.574  JCup                            0.528  JLex                              Software  gzip  bzip2  zip  rar  lzip  feh  qiv  sxiv  pho  md5  OL  Cal  JCup  JLex  gzip    0.055  0.475  0.214  0.180  0.164  0.063  0.157  0.047  0.517  0.275  0.185  0.078  0.113  bzip2      0.069  0.114  0.116  0.111  0.184  0.177  0.125  0.041  0.087  0.190  0.316  0.319  zip        0.341  0.287  0.251  0.095  0.240  0.081  0.312  0.376  0.282  0.117  0.176  rar          0.372  0.346  0.149  0.324  0.121  0.206  0.413  0.363  0.173  0.258  lzip            0.177  0.080  0.187  0.069  0.141  0.319  0.255  0.121  0.210  feh              0.258  0.555  0.174  0.162  0.355  0.565  0.254  0.383  qiv                0.327  0.700  0.062  0.150  0.253  0.570  0.361  sxiv                  0.277  0.143  0.363  0.591  0.372  0.530  pho                    0.046  0.142  0.226  0.508  0.345  md5                      0.235  0.171  0.069  0.102  OL                        0.369  0.185  0.261  Cal                          0.365  0.574  JCup                            0.528  JLex                              View Large Table 4. Credibility of DKISB. Software  gzip  bzip2  zip  rar  lzip  feh  qiv  sxiv  pho  md5  OL  Cal  JCup  JLex  gzip    0.000  0.789  0.076  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  bzip2      0.000  0.175  0.025  0.066  0.059  0.279  0.024  0.000  0.000  0.000  0.160  0.000  zip        0.121  0.000  0.009  0.004  0.011  0.002  0.000  0.000  0.000  0.002  0.000  rar          0.000  0.004  0.024  0.006  0.007  0.000  0.000  0.006  0.002  0.006  lzip            0.155  0.065  0.080  0.073  0.000  0.000  0.000  0.014  0.000  feh              0.497  0.439  0.493  0.001  0.001  0.051  0.050  0.051  qiv                0.717  0.892  0.053  0.024  0.816  0.035  0.813  sxiv                  0.703  0.018  0.013  0.628  0.175  0.626  pho                    0.023  0.017  0.810  0.041  0.808  md5                      0.246  0.028  0.000  0.027  OL                        0.021  0.000  0.020  Cal                          0.057  0.998  JCup                            0.065  JLex                              Software  gzip  bzip2  zip  rar  lzip  feh  qiv  sxiv  pho  md5  OL  Cal  JCup  JLex  gzip    0.000  0.789  0.076  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  bzip2      0.000  0.175  0.025  0.066  0.059  0.279  0.024  0.000  0.000  0.000  0.160  0.000  zip        0.121  0.000  0.009  0.004  0.011  0.002  0.000  0.000  0.000  0.002  0.000  rar          0.000  0.004  0.024  0.006  0.007  0.000  0.000  0.006  0.002  0.006  lzip            0.155  0.065  0.080  0.073  0.000  0.000  0.000  0.014  0.000  feh              0.497  0.439  0.493  0.001  0.001  0.051  0.050  0.051  qiv                0.717  0.892  0.053  0.024  0.816  0.035  0.813  sxiv                  0.703  0.018  0.013  0.628  0.175  0.626  pho                    0.023  0.017  0.810  0.041  0.808  md5                      0.246  0.028  0.000  0.027  OL                        0.021  0.000  0.020  Cal                          0.057  0.998  JCup                            0.065  JLex                              View Large Table 4. Credibility of DKISB. Software  gzip  bzip2  zip  rar  lzip  feh  qiv  sxiv  pho  md5  OL  Cal  JCup  JLex  gzip    0.000  0.789  0.076  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  bzip2      0.000  0.175  0.025  0.066  0.059  0.279  0.024  0.000  0.000  0.000  0.160  0.000  zip        0.121  0.000  0.009  0.004  0.011  0.002  0.000  0.000  0.000  0.002  0.000  rar          0.000  0.004  0.024  0.006  0.007  0.000  0.000  0.006  0.002  0.006  lzip            0.155  0.065  0.080  0.073  0.000  0.000  0.000  0.014  0.000  feh              0.497  0.439  0.493  0.001  0.001  0.051  0.050  0.051  qiv                0.717  0.892  0.053  0.024  0.816  0.035  0.813  sxiv                  0.703  0.018  0.013  0.628  0.175  0.626  pho                    0.023  0.017  0.810  0.041  0.808  md5                      0.246  0.028  0.000  0.027  OL                        0.021  0.000  0.020  Cal                          0.057  0.998  JCup                            0.065  JLex                              Software  gzip  bzip2  zip  rar  lzip  feh  qiv  sxiv  pho  md5  OL  Cal  JCup  JLex  gzip    0.000  0.789  0.076  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  bzip2      0.000  0.175  0.025  0.066  0.059  0.279  0.024  0.000  0.000  0.000  0.160  0.000  zip        0.121  0.000  0.009  0.004  0.011  0.002  0.000  0.000  0.000  0.002  0.000  rar          0.000  0.004  0.024  0.006  0.007  0.000  0.000  0.006  0.002  0.006  lzip            0.155  0.065  0.080  0.073  0.000  0.000  0.000  0.014  0.000  feh              0.497  0.439  0.493  0.001  0.001  0.051  0.050  0.051  qiv                0.717  0.892  0.053  0.024  0.816  0.035  0.813  sxiv                  0.703  0.018  0.013  0.628  0.175  0.626  pho                    0.023  0.017  0.810  0.041  0.808  md5                      0.246  0.028  0.000  0.027  OL                        0.021  0.000  0.020  Cal                          0.057  0.998  JCup                            0.065  JLex                              View Large 4.3. Resiliency To evaluate the resiliency of WSB, it is verified from three aspects: resilient to compilers, resilient to obfuscators and resilient to packers. The three aspects are divided according to obfuscation strategies. 4.3.1. Resilient to compilers For the same source code, different compilers and compiler optimization levers (Opt-Lv) generate different versions of executable files. A lot of plagiarism cases are in this simple way. Bzip2 and pigz, two Com&Dec software, are chosen to simulate this plagiarism scene. They respectively have 12 versions that are complied with two compilers (llvm3.2 and gcc4.6.3) and multiple optimization parameters (debug, release and o1∼o3). Table 5 shows the similarities of different bzip2 versions. It is observed that all the similarities are higher than 0.9 and some of them even equal 1, indicating that WSB is resilient to different compilers and optimization parameters for bzip2. Table 6 are the similarities of different versions of pigz which is a multithreaded program. Although most of similarities are not higher than 0.9, the lower of them stay around 0.8. All the pairs in this scene can be judged as plagiarism successfully with appropriate thresholds ε. It proves that the WSB is resilient to different compilers and optimization parameters. Table 5. bzip2: Resilient to compilers. Compiler    llvm3.2  gcc4.6.3    Opt-Lv  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  llvm3.2  D-o1    0.955  0.952  0.998  0.955  0.951  0.997  0.953  0.952  0.998  0.956  0.953  D-o2      0.988  0.954  1.000  0.989  0.954  0.998  0.984  0.954  0.999  0.981  D-o3        0.951  0.988  0.999  0.951  0.986  0.995  0.951  0.989  0.991  R-o1          0.954  0.949  0.999  0.952  0.951  1.000  0.955  0.952  R-o2            0.988  0.954  0.998  0.984  0.954  0.999  0.981  R-o3              0.949  0.986  0.994  0.949  0.990  0.990  gcc4.6.3  D-o1                0.953  0.951  1.000  0.955  0.951  D-o2                  0.983  0.952  0.997  0.979  D-o3                    0.951  0.986  0.995  R-o1                      0.955  0.952  R-o2                        0.983  R-o3                          Compiler    llvm3.2  gcc4.6.3    Opt-Lv  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  llvm3.2  D-o1    0.955  0.952  0.998  0.955  0.951  0.997  0.953  0.952  0.998  0.956  0.953  D-o2      0.988  0.954  1.000  0.989  0.954  0.998  0.984  0.954  0.999  0.981  D-o3        0.951  0.988  0.999  0.951  0.986  0.995  0.951  0.989  0.991  R-o1          0.954  0.949  0.999  0.952  0.951  1.000  0.955  0.952  R-o2            0.988  0.954  0.998  0.984  0.954  0.999  0.981  R-o3              0.949  0.986  0.994  0.949  0.990  0.990  gcc4.6.3  D-o1                0.953  0.951  1.000  0.955  0.951  D-o2                  0.983  0.952  0.997  0.979  D-o3                    0.951  0.986  0.995  R-o1                      0.955  0.952  R-o2                        0.983  R-o3                          View Large Table 5. bzip2: Resilient to compilers. Compiler    llvm3.2  gcc4.6.3    Opt-Lv  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  llvm3.2  D-o1    0.955  0.952  0.998  0.955  0.951  0.997  0.953  0.952  0.998  0.956  0.953  D-o2      0.988  0.954  1.000  0.989  0.954  0.998  0.984  0.954  0.999  0.981  D-o3        0.951  0.988  0.999  0.951  0.986  0.995  0.951  0.989  0.991  R-o1          0.954  0.949  0.999  0.952  0.951  1.000  0.955  0.952  R-o2            0.988  0.954  0.998  0.984  0.954  0.999  0.981  R-o3              0.949  0.986  0.994  0.949  0.990  0.990  gcc4.6.3  D-o1                0.953  0.951  1.000  0.955  0.951  D-o2                  0.983  0.952  0.997  0.979  D-o3                    0.951  0.986  0.995  R-o1                      0.955  0.952  R-o2                        0.983  R-o3                          Compiler    llvm3.2  gcc4.6.3    Opt-Lv  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  llvm3.2  D-o1    0.955  0.952  0.998  0.955  0.951  0.997  0.953  0.952  0.998  0.956  0.953  D-o2      0.988  0.954  1.000  0.989  0.954  0.998  0.984  0.954  0.999  0.981  D-o3        0.951  0.988  0.999  0.951  0.986  0.995  0.951  0.989  0.991  R-o1          0.954  0.949  0.999  0.952  0.951  1.000  0.955  0.952  R-o2            0.988  0.954  0.998  0.984  0.954  0.999  0.981  R-o3              0.949  0.986  0.994  0.949  0.990  0.990  gcc4.6.3  D-o1                0.953  0.951  1.000  0.955  0.951  D-o2                  0.983  0.952  0.997  0.979  D-o3                    0.951  0.986  0.995  R-o1                      0.955  0.952  R-o2                        0.983  R-o3                          View Large Table 6. pigz: Resilient to compilers. Compiler    llvm3.2  gcc4.6.3    Opt-Lv  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  llvm3.2  D-o1    0.860  0.812  0.863  0.798  0.804  0.898  0.890  0.856  0.854  0.861  0.839  D-o2      0.909  0.856  0.880  0.918  0.850  0.849  0.934  0.796  0.810  0.845  D-o3        0.848  0.946  0.884  0.828  0.798  0.888  0.791  0.808  0.826  R-o1          0.893  0.895  0.814  0.802  0.844  0.854  0.855  0.826  R-o2            0.938  0.814  0.794  0.873  0.825  0.834  0.862  R-o3              0.811  0.811  0.894  0.832  0.834  0.873  gcc4.6.3  D-o1                0.940  0.877  0.909  0.906  0.878  D-o2                  0.883  0.890  0.903  0.885  D-o3                    0.825  0.840  0.886  R-o1                      0.971  0.915  R-o2                        0.915  R-o3                          Compiler    llvm3.2  gcc4.6.3    Opt-Lv  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  llvm3.2  D-o1    0.860  0.812  0.863  0.798  0.804  0.898  0.890  0.856  0.854  0.861  0.839  D-o2      0.909  0.856  0.880  0.918  0.850  0.849  0.934  0.796  0.810  0.845  D-o3        0.848  0.946  0.884  0.828  0.798  0.888  0.791  0.808  0.826  R-o1          0.893  0.895  0.814  0.802  0.844  0.854  0.855  0.826  R-o2            0.938  0.814  0.794  0.873  0.825  0.834  0.862  R-o3              0.811  0.811  0.894  0.832  0.834  0.873  gcc4.6.3  D-o1                0.940  0.877  0.909  0.906  0.878  D-o2                  0.883  0.890  0.903  0.885  D-o3                    0.825  0.840  0.886  R-o1                      0.971  0.915  R-o2                        0.915  R-o3                          View Large Table 6. pigz: Resilient to compilers. Compiler    llvm3.2  gcc4.6.3    Opt-Lv  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  llvm3.2  D-o1    0.860  0.812  0.863  0.798  0.804  0.898  0.890  0.856  0.854  0.861  0.839  D-o2      0.909  0.856  0.880  0.918  0.850  0.849  0.934  0.796  0.810  0.845  D-o3        0.848  0.946  0.884  0.828  0.798  0.888  0.791  0.808  0.826  R-o1          0.893  0.895  0.814  0.802  0.844  0.854  0.855  0.826  R-o2            0.938  0.814  0.794  0.873  0.825  0.834  0.862  R-o3              0.811  0.811  0.894  0.832  0.834  0.873  gcc4.6.3  D-o1                0.940  0.877  0.909  0.906  0.878  D-o2                  0.883  0.890  0.903  0.885  D-o3                    0.825  0.840  0.886  R-o1                      0.971  0.915  R-o2                        0.915  R-o3                          Compiler    llvm3.2  gcc4.6.3    Opt-Lv  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  D-o1  D-o2  D-o3  R-o1  R-o2  R-o3  llvm3.2  D-o1    0.860  0.812  0.863  0.798  0.804  0.898  0.890  0.856  0.854  0.861  0.839  D-o2      0.909  0.856  0.880  0.918  0.850  0.849  0.934  0.796  0.810  0.845  D-o3        0.848  0.946  0.884  0.828  0.798  0.888  0.791  0.808  0.826  R-o1          0.893  0.895  0.814  0.802  0.844  0.854  0.855  0.826  R-o2            0.938  0.814  0.794  0.873  0.825  0.834  0.862  R-o3              0.811  0.811  0.894  0.832  0.834  0.873  gcc4.6.3  D-o1                0.940  0.877  0.909  0.906  0.878  D-o2                  0.883  0.890  0.903  0.885  D-o3                    0.825  0.840  0.886  R-o1                      0.971  0.915  R-o2                        0.915  R-o3                          View Large 4.3.2. Resilient to obfuscators It is the easiest way to obfuscate software with compilers while a more complicated way uses advanced obfuscation techniques implemented in special tools. To evaluate the resiliency in this scene, JLex, JCup and Cal are chosen from the benchmark that DKISB used. There are 32 versions of JLex, 34 versions of JCup and 36 versions of Cal. And each version is generated by obfuscation tools listed in Table 2. The similarity of each obfuscated version and their original version is calculated. As is shown in Figs 5–7, the x-axis shows the obfuscated versions of each program and the y-axis gives the similarities of its obfuscated versions to its original version. Figure 5. View largeDownload slide JLex: Resilient to obfuscators. Figure 5. View largeDownload slide JLex: Resilient to obfuscators. Figure 6. View largeDownload slide JCup: Resilient to obfuscators. Figure 6. View largeDownload slide JCup: Resilient to obfuscators. Figure 7. View largeDownload slide Cal: Resilient to obfuscators. Figure 7. View largeDownload slide Cal: Resilient to obfuscators. Figures 5–7 show that the similarities between different versions of JLex and Cal are all higher than 0.8, and most of them are higher than 0.9. For JCup, there are only three pairs whose similarities are lower than 0.8 and the others are higher than 0.8. The results illustrate that most of pairs in this kind of plagiarism are able to be detected clearly. It proves that the proposed WSB is also resilient to advanced obfuscation techniques implemented in tools listed in Table 2 to some extent. 4.3.3. Resilient to packers To protect source code, most software will be packed before it is released. Correspondingly, it has also become an important form of plagiarism. Our experiments are conducted on linux, two common used linux packers as listed in Table 2 are selected. Upx is a compression packer, while MidgetPack is an encryption packer. They represent the two types of packer respectively. In Table 7, the resiliency of packers between WSB and DKISB is compared. As is shown in Table 7, the row shows the original version of software(S) and the column gives the packed version of it. There are three packed versions for each software. They are versions that packed using Upx, packed using MidgetPack with parameter curve ( Mi−c) and packed with parameter password ( Mi−p), respectively. Each value in the table is the similarity of the original version and the packed version of it. Each horizontal line in the table denotes that the similarity of this pair cannot be calculated. Table 7. Comparing the Resiliency to packers between WSB and DKISB. S  WSB  DKISB    Mi-c  Mi-p  Upx  Mi-c  Mi-p  Upx  bzip2  0.999  0.999  0.995  –  –  1.000  gzip  0.988  0.988  0.912  –  –  1.000  pigz  0.986  0.989  0.967  –  –  1.000  lzip  0.932  0.930  0.919  –  –  1.000  feh  0.920  0.927  0.900  –  –  1.000  pho  0.938  0.934  0.942  –  –  0.999  qiv  0.957  0.968  0.964  –  –  0.999  sxiv  0.940  0.923  0.874  –  –  0.999  md5  0.919  0.882  0.937  –  –  1.000  OL  0.967  0.977  0.971  –  –  1.000  Cal  0.871  0.871  0.883  –  –  1.000  S  WSB  DKISB    Mi-c  Mi-p  Upx  Mi-c  Mi-p  Upx  bzip2  0.999  0.999  0.995  –  –  1.000  gzip  0.988  0.988  0.912  –  –  1.000  pigz  0.986  0.989  0.967  –  –  1.000  lzip  0.932  0.930  0.919  –  –  1.000  feh  0.920  0.927  0.900  –  –  1.000  pho  0.938  0.934  0.942  –  –  0.999  qiv  0.957  0.968  0.964  –  –  0.999  sxiv  0.940  0.923  0.874  –  –  0.999  md5  0.919  0.882  0.937  –  –  1.000  OL  0.967  0.977  0.971  –  –  1.000  Cal  0.871  0.871  0.883  –  –  1.000  View Large Table 7. Comparing the Resiliency to packers between WSB and DKISB. S  WSB  DKISB    Mi-c  Mi-p  Upx  Mi-c  Mi-p  Upx  bzip2  0.999  0.999  0.995  –  –  1.000  gzip  0.988  0.988  0.912  –  –  1.000  pigz  0.986  0.989  0.967  –  –  1.000  lzip  0.932  0.930  0.919  –  –  1.000  feh  0.920  0.927  0.900  –  –  1.000  pho  0.938  0.934  0.942  –  –  0.999  qiv  0.957  0.968  0.964  –  –  0.999  sxiv  0.940  0.923  0.874  –  –  0.999  md5  0.919  0.882  0.937  –  –  1.000  OL  0.967  0.977  0.971  –  –  1.000  Cal  0.871  0.871  0.883  –  –  1.000  S  WSB  DKISB    Mi-c  Mi-p  Upx  Mi-c  Mi-p  Upx  bzip2  0.999  0.999  0.995  –  –  1.000  gzip  0.988  0.988  0.912  –  –  1.000  pigz  0.986  0.989  0.967  –  –  1.000  lzip  0.932  0.930  0.919  –  –  1.000  feh  0.920  0.927  0.900  –  –  1.000  pho  0.938  0.934  0.942  –  –  0.999  qiv  0.957  0.968  0.964  –  –  0.999  sxiv  0.940  0.923  0.874  –  –  0.999  md5  0.919  0.882  0.937  –  –  1.000  OL  0.967  0.977  0.971  –  –  1.000  Cal  0.871  0.871  0.883  –  –  1.000  View Large It is observed that all the similarities are higher than 0.8 for WSB. Some of them are even higher than 0.9. It indicates that WSB is resilient to the two packers. However, some of similarities cannot be calculated for DKISB. The reason is that the versions packed using MidgetPack need the key or password from outside input to decrypt the original program. When using DKISB, the DDFT module will track the key or password. The module obtains instructions not related to the original program. Furthermore, the analysis module will run out of the memory space because that the decryption process is too complicated. Therefore, the key instruction sequences cannot be obtained. The experiments use two packers while the DKISB is only resilient to Upx and does not work for MidgetPack. Comparing the two birthmarks, the proposed WSB resists not only Upx but also Midgetpack. 4.4. Discussion of parameters N and ε With different values of N which is used in n-gram method, the similarities of software pairs can be slightly different. It may affect the final decision. Therefore, it is necessary to search a suitable value of N. Besides, the choice of threshold ε is also important to the final decision. In this section, the choices of parameters N and ε are discussed. It tries to find the most suitable values of them. To comprehensively evaluate the performance of our proposed birthmark, an extend definition of precision, recall and F-Measure in [37] is adopted for the detection result given by our method may be inconclusive according to the formula (4). They are customized to the following definitions:   Precision=∣EP∩JP∣+∣EI∩JI∣∣JP∣+∣JI∣ (5)  Recall=∣EP∩JP∣+∣EI∩JI∣∣EP∣+∣EI∣ (6)  FM=2×Precision×RecallPrecision+Recall (7)where EP represents the set of comparison pairs that have plagiarism and JP represents the set that are judged plagiarism by our method. Similarly, EI represents the set that are independent and JP represents the set that are judged as independent by our method. As is mentioned above, the detection result of our approach relies on both parameter N and ε. By increasing ε from 0.05 to 0.5 in an increment of 0.05, we can correspondingly draw the precision, recall and FM curve with N varied from 2 to 8. All the test software come from Table 1 and their plagiaristic versions. There are totally 120 pairs of independent software and 2271 pairs of plagiarism software. Figures 8–10 show the precision, recall and FM curve of WSB without considering the imbalance of data. It is observed that the most suitable value of N is 3 and ε is 0.5. In our experiments, the maximum of FM is 0.994 in this situation. While the amounts of independent software pairs and plagiarism software pairs are imbalanced. According to the literature [38], we use undersampling method to divide the set of plagiarism software pairs into 19 subsets. Each set is used to calculate the precision, recall and FM with the set of independent software pairs and the averages of these precisions, recalls and FMs are used as the final precision, recall and FM. Figures 11–13 show the precision, recall and FM curve of WSB when data are balanced. It is observed that the most suitable value of N is also 3 and ε is also 0.5. The maximum of FM is 0.968 now. Figure 8. View largeDownload slide Precision of WSB when data are imbalanced. Figure 8. View largeDownload slide Precision of WSB when data are imbalanced. Figure 9. View largeDownload slide Recall of WSB when data are imbalanced. Figure 9. View largeDownload slide Recall of WSB when data are imbalanced. Figure 10. View largeDownload slide F-Measure of WSB when data are imbalanced. Figure 10. View largeDownload slide F-Measure of WSB when data are imbalanced. Figure 11. View largeDownload slide Precision of WSB when data are balanced. Figure 11. View largeDownload slide Precision of WSB when data are balanced. Figure 12. View largeDownload slide Recall of WSB when data are balanced. Figure 12. View largeDownload slide Recall of WSB when data are balanced. Figure 13. View largeDownload slide F-Measure of WSB when data are balanced. Figure 13. View largeDownload slide F-Measure of WSB when data are balanced. Moreover, two or more subsets of the 19 subsets can be merged into one subset before calculating the FM with the set of independent software pairs. And the average of maximum FMs can be calculated in different proportion of independent software pairs and plagiarism software pairs. Figure 14 shows the relationship between the maximum FM and the proportion. The x-axis represents the proportion and the y-axis shows the value of maximum FM. In this experiment, we choose the proportion 1:1,1:2,1:3,1:4,1:5,1:6,1:7,1:8,1:9,1:19 and calculate the maximum FM, respectively. When the data are more and more balanced, the FM value gradually decreases. In our experiments, the maximum value of FM is equal to 0.968 when data are balanced. It indicates that WSB can satisfy both the credibility and the resiliency of birthmark. Figure 14. View largeDownload slide The relationship between the maximum F-Measure and the proportion. Figure 14. View largeDownload slide The relationship between the maximum F-Measure and the proportion. 5. CONCLUSIONS Although most of existing dynamic birthmarks have high performance on resilient to obfuscation techniques, they are often not suitable for encryption packer. The decryption process also executes during the original program running, which will cause noise interference to the dynamic birthmark. Our WSB takes advantage of the different patterns of dynamic basic block replications to reduce the effect of these disturbances. It is resilient to not only obfuscation techniques but also encryption packer. Our experiments also show that the WSB is a high-quality birthmark which keeps both the credibility and resiliency. It’s even better than DKISB in some situations. However, as a dynamic software birthmark, the WSB still has the following limitations: The WSB can only be used for executable files, dynamic or static link library files are not suitable. This birthmark mainly considers the patterns of dynamic basic block call relationships and does not take the program statements and semantics into consideration. Therefore, this method is not applicable to component plagiarism. This birthmark is a fine-grained birthmark based on dynamic basic blocks. When the scale of the software is too large, the process of extracting birthmark is time-consuming for parsing the large log file. Our future research will focus more on the above issues. For library files, dynamic software birthmark does not work because library files cannot run independently. Component plagiarism detection is also hard to use dynamic software birthmark because it is difficult to locate the copied parts. Static software birthmark may be more suitable for these situations and symbolic execution may be used to enhance semantic analysis. To accommodate large-scale software, we can try to find an effective coarse-grained dynamic birthmark. FUNDING This work was supported by the National Key Research and Development Program (2016QY06X1205 and 2016YFB0800605), the National Natural Science Foundation of China (91338107) and Technology Research and Development Program of Sichuan, China (17ZDYF2583). ACKNOWLEDGEMENTS We thank anonymous reviewers for their valuable suggestions and comments, which help us to improve the quality of this paper. Footnotes 1 A binary packer for ELF binaries. https://github.com/arisada/midgetpack 2 It is a special case of pq-Gram in graph when p equals n−2 and q equals 1. 3 A second generation Java obfuscator, which offers a full spectrum of protection for your intellectual property. http://www.allatori.com. 4 DynamoRIO is a runtime code manipulation system that supports code transformations on any part of a program, while it executes. http://www.dynamorio.org/. REFERENCES 1 BSA. Unlicensed software use still high globally despite costly cybersecurity threats. http://globalstudy.bsa.org/2016/. 2 Revulytics. Top 20 countries for software piracy and license misuse ( 2017). https://www.revulytics.com/blog/top-20-countries-software-piracy-2017. 3 Lancaster, T. and Culwin, F. ( 2004) A comparison of source code plagiarism detection engines. Comput. Sci. Educ. , 14, 101– 112. Google Scholar CrossRef Search ADS   4 Ceccato, M., Di Penta, M., Nagra, J., Falcarin, P., Ricca, F., Torchiano, M. and Tonella, P. ( 2009) The Effectiveness of Source Code Obfuscation: An Experimental Assessment. IEEE Int. Conf. Program Comprehension, Vancouver, British Columbia, Canada, 17–19 May, pp. 178–187. IEEE Computer Society, Los Alamitos, CA. 5 Collberg, C. and Thomborson, C. ( 1999) Software Watermarking: Models and Dynamic Embeddings. Proc. 26th ACM SIGPLAN-SIGACT Symp. Principles of Programming Languages, San Antonio, TX, USA, 20–22 January, pp. 311–324. ACM, New York, NY, USA. 6 Collberg, C., Carter, E., Debray, S., Huntwork, A., Kececioglu, J., Linn, C. and Stepp, M. ( 2004) Dynamic path-based software watermarking. ACM Sigplan Not. , 39, 107– 118. Google Scholar CrossRef Search ADS   7 NAIST-IS-TR2003014 ( 2003) Detecting the Theft of Programs Using Birthmarks . Springer, New York. 8 Kim, D., Han, Y., Cho, S.-j., Yoo, H., Woo, J., Nah, Y., Park, M. and Chung, L. ( 2013) Measuring Similarity of Windows Applications Using Static and Dynamic Birthmarks. Proc. 28th Annu. ACM Symp. Applied Computing, Coimbra, Portugal, 18–22 March, pp. 1628–1633. ACM, New York, NY, USA. 9 Roundy, K.A. and Miller, B.P. ( 2013) Binary-code obfuscations in prevalent packer tools. Acm Comput. Surv. , 46, 1– 32. Google Scholar CrossRef Search ADS   10 Ugarte-Pedrero, X., Santos, I., Sanz, B., Laorden, C. and Bringas, P.G. ( 2012) Countering Entropy Measure Attacks on Packed Software Detection. Consumer Communications and Networking Conference (CCNC), 2012 IEEE, Planet Hollywood, Las Vegas, NV, USA, 14–17 January, pp. 164–168. IEEE Computer Society, Los Alamitos, CA. 11 Bai, Y., Sun, X., Sun, G., Deng, X. and Zhou, X. ( 2008) Dynamic k-Gram based Software Birthmark. 19th Austral. Conf. Software Engineering, Perth, WA, Australian, 26–28 March, pp. 644–649. IEEE Computer Society, Los Alamitos, CA. 12 Yuanpeng, S. ( 2012) Software protection method based on shell technology. Softw. Eng. Appl. , 01, 47– 53. Google Scholar CrossRef Search ADS   13 Tian, Z., Zheng, Q., Liu, T. and Fan, M. ( 2013) DKISB: Dynamic Key Instruction Sequence Birthmark for Software Plagiarism Detection. IEEE Int. Conf. High Performance Computing and Communications & 2013 IEEE Int. Conf. Embedded and Ubiquitous Computing, Zhangjiajie, China, 13–15 November, pp. 619–627. IEEE Computer Society, Los Alamitos, CA. 14 Augsten, N., Hlen, M. and Gamper, J. ( 2005) Approximate Matching of Hierarchical Data Using pq-Grams. Proc. 31st Int. Conf. Very Large Data Bases, Trondheim, Norway, 30 August–2 September, pp. 301–312. VLDB Endowment. 15 Grier, S. ( 1981) A tool that detects plagiarism in Pascal programs. ACM SIGCSE Bull. , 13, 15– 20. Google Scholar CrossRef Search ADS   16 Aiken, A. ( 1994) A system for detecting software plagiarism. Retrieved April, 1, 2010. 17 Wise, M.J. ( 1996) Yap3: improved detection of similarities in computer program and other texts. ACM SIGCSE Bull. , 28, 130– 134. Google Scholar CrossRef Search ADS   18 Prechelt, L., Malpohl, G. and Philippsen, M. ( 2002) Finding plagiarisms among a set of programs with jplag. J. UCS , 8, 1016. 19 Flores, E., Barrón-Cedeño, A., Rosso, P. and Moreno, L. ( 2012) Desocore: Detecting Source Code Re-use Across Programming Languages. Proc. 2012 Conf. North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstration Session, Montreal, Canada, 3–8 June, pp. 1–4. Association for Computational Linguistics Stroudsburg, PA, USA. 20 Flores, E., Barrón-Cedeño, A., Moreno, L. and Rosso, P. ( 2015) Uncovering source code reuse in large-scale academic environments. Comput. Appl. Eng. Educ. , 23, 383– 390. Google Scholar CrossRef Search ADS   21 Flores, E., Barrón-Cedeño, A., Moreno, L. and Rosso, P. ( 2015) Cross-language source code re-use detection using latent semantic analysis. J. Universal Comput. Sci. , 21, 1708– 1725. 22 Flores, E., Rosso, P., Moreno, L. and Villatoro, E. ( 2014) Pan@fire: Overview of SOCO Track on the Detection of Source Code Re-use. Notebook Papers of FIRE 2014, FIRE-2014, Bangalore, India, 5–7 December, pp. 1–11. ACM, New York, NY, USA. 23 Flores, E., Rosso, P., Villatoro, E., Moreno, L., Alcover, R. and Chirivella, V. ( 2015) Pan@fire: Overview of CL-SOCO Track on the Detection of Cross-Language Source Code Re-use. Notebook Papers of FIRE 2015, FIRE-2015, Gandhinagar, India, 4–6 December, pp. 1–5. CEUR-WS.org. 24 Tamada, H., Nakamura, M., Monden, A. and Matsumoto, K.-I. ( 2004) Design and Evaluation of Birthmarks for Detecting Theft of Java Programs. IASTED Conf. Software Engineering, Innsbruck, Austria, 17–19 February, pp. 569–574. ACTA Press, Calgary, AB, Canada. 25 Myles, G. and Collberg, C. ( 2005) K-Gram Based Software Birthmarks. Proc. 2005 ACM Symp. Applied Computing, Santa Fe, New Mexico, 13–17 March, pp. 314–318. ACM, New York, NY, USA. 26 Lim, H.-i., Park, H., Choi, S. and Han, T. ( 2009) A method for detecting the theft of java programs through analysis of the control flow information. Inf. Softw. Technol. , 51, 1338– 1350. Google Scholar CrossRef Search ADS   27 Luo, L., Ming, J., Wu, D., Liu, P. and Zhu, S. ( 2014) Semantics-Based Obfuscation-Resilient Binary Code Similarity Comparison with Applications to Software Plagiarism Detection. Proc. 22nd ACM SIGSOFT Int. Symp. Foundations of Software Engineering, Hong Kong, China, 16–22, November, pp. 389–400. ACM, New York, NY, USA. 28 Myles, G. and Collberg, C. ( 2004) Detecting Software Theft via Whole Program Path Birthmarks. Lecture Notes in Computer Science, 3225, 404–415. 29 Wang, X., Jhi, Y.-C., Zhu, S. and Liu, P. ( 2009) Behavior Based Software Theft Detection. Proc. 16th ACM Conf. Computer and Communications Security, Chicago, IL, USA, 09–13, November, pp. 280–290. ACM, New York, NY, USA. 30 Chan, P.P., Hui, L.C. and Yiu, S.-M. ( 2011) Dynamic Software Birthmark for Java Based on Heap Memory Analysis. IFIP Int. Conf. Communications and Multimedia Security, Ghent, Belgium, 19–21, October, pp. 94–107. Springer-Verlag, Berlin, Heidelberg. 31 Tamada, H., Okamoto, K., Nakamura, M., Monden, A. and Matsumoto, K.-i. ( 2004) Dynamic Software Birthmarks to Detect the Theft of Windows Applications. In Proc. Int. Symp. Future Software Technology 2004, Xi’an, China, 20–22 October, pp. 1–6. ISFST, Wuhan, China. 32 Chae, D.-K., Kim, S.-W., Cho, S.-J. and Kim, Y. ( 2015) Daav: Dynamic API Authority Vectors for Detecting Software Theft. Proc. 24th ACM Int. Conf. Information and Knowledge Management, Melbourne, VIC, Australia, 19–23 October, pp. 1819–1822. ACM, New York, NY, USA. 33 Wang, X., Jhi, Y.-C., Zhu, S. and Liu, P. ( 2009) Detecting Software Theft via System Call Based Birthmarks. Computer Security Applications Conference, 2009. ACSAC’09. Annual, Honolulu, HI, 7–11 December, pp. 149–158. IEEE Computer Society, Los Alamitos, CA. 34 Jhi, Y.-C., Wang, X., Jia, X., Zhu, S., Liu, P. and Wu, D. ( 2011) Value-Based Program Characterization and Its Application to Software Plagiarism Detection. 2011 33rd Int. Conf. Software Engineering (ICSE), Waikiki, Honolulu, Hawaii, 21–28, May, pp. 756–765. IEEE Computer Society, Los Alamitos, CA. 35 Zhang, F., Wu, D., Liu, P. and Zhu, S. ( 2014) Program Logic Based Software Plagiarism Detection. 2014 IEEE 25th Int. Symp. Software Reliability Engineering (ISSRE), Naples, Italy, 3–6, November, pp. 66–77. IEEE Computer Society, Los Alamitos, CA. 36 Lee, D., Choi, Y., Jung, J., Kim, J. and Won, D. ( 2015) An efficient categorization of the instructions based on binary executables for dynamic software birthmark. Int. J. Inf. Educ. Technol. , 5, 571. 37 Tian, Z., Zheng, Q., Liu, T., Fan, M., Zhang, X. and Yang, Z. ( 2014) Plagiarism Detection for Multithreaded Software Based on Thread-Aware Software Birthmarks. Proc. 22nd Int. Conf. Program Comprehension, Hyderabad, India, 31 May–07 June, pp. 304–313. ACM, New York, NY, USA. 38 He, H. and Garcia, E.A. ( 2009) Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. , 21, 1263– 1284. Google Scholar CrossRef Search ADS   Author notes Handling editor: Albert Levi © The British Computer Society 2018. All rights reserved. For Permissions, please email: journals.permissions@oup.com This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/about_us/legal/notices)

Journal

The Computer JournalOxford University Press

Published: Jun 1, 2018

There are no references for this article.

You’re reading a free preview. Subscribe to read the entire article.


DeepDyve is your
personal research library

It’s your single place to instantly
discover and read the research
that matters to you.

Enjoy affordable access to
over 18 million articles from more than
15,000 peer-reviewed journals.

All for just $49/month

Explore the DeepDyve Library

Search

Query the DeepDyve database, plus search all of PubMed and Google Scholar seamlessly

Organize

Save any article or search result from DeepDyve, PubMed, and Google Scholar... all in one place.

Access

Get unlimited, online access to over 18 million full-text articles from more than 15,000 scientific journals.

Your journals are on DeepDyve

Read from thousands of the leading scholarly journals from SpringerNature, Elsevier, Wiley-Blackwell, Oxford University Press and more.

All the latest content is available, no embargo periods.

See the journals in your area

DeepDyve

Freelancer

DeepDyve

Pro

Price

FREE

$49/month
$360/year

Save searches from
Google Scholar,
PubMed

Create lists to
organize your research

Export lists, citations

Read DeepDyve articles

Abstract access only

Unlimited access to over
18 million full-text articles

Print

20 pages / month

PDF Discount

20% off