TY - JOUR AU - MartÍnez-Graullera,, O AB - Abstract Initially developed for tasks related to computer graphics, graphics processing units are increasingly being used for general purpose processing, including scientific and engineering applications. In this contribution, we have compared the performance of two graphics cards that belong to the parallel computing Compute Unified Device Architecture (CUDA) platform with two C++ and Java multi-threading implementations, using as an example of computation a brute-force attack on Hitag2, a well-known remote keyless entry application. The results allow us to provide valuable information regarding the compared capabilities of the tested platforms and to confirm that such a weak encryption system could be broken in less than a day with medium cost equipment. 1 Introduction In symmetric encryption algorithms, brute-force attacks consist in checking all possible keys until the correct one is found. In the worst case scenario, all the keys from the entire key space are tested, while in average it is necessary to check only half the number of possible keys. Modern encryption algorithms are designed so that this kind of attack is infeasible, as the search for the key would take millions of years. Legacy algorithms were also designed with that goal in mind, with the difference that their designers could not anticipate the spectacular increase in computing capability that could be employed by organized groups or even individuals in such a task. For that reason, we have considered that it is of interest to compare the computing capability provided by several platforms when using a brute-force approach on legacy algorithms such as Hitag2, which was introduced in 1996 but still can be found in millions of devices [1]. Besides, the simplicity of Hitag2 makes it an ideal candidate for its implementation using Compute Unified Device Architecture (CUDA), the parallel computing platform based on graphics processor units (GPUs) created by NVIDIA. In addition to this, we decided to implement other versions of our brute-force application using Java and C++, so we could test their multi-threading capabilities, where in the case of the C++ implementation we have used Open Multi-Processing (OpenMP), a well-known multi-threading library. In this way, we have been able to check the latest improvements in Java regarding its performance and analyse how fast it is compared to a native C++ application. The results permit us to state that these implementations allow attackers to complete a brute-force attack against this kind of algorithms in less than a day using publicly available hardware (in our case, a medium-tier CUDA card). It must be clarified that, given the design of Hitag2, this stream cipher has been considered insecure for some years [2–5], and as such it can be attacked using expensive devices such as COPACOBANA [6]. Thus, our goal is not to show that Hitag2 is insecure but to compare low and medium cost technologies that can be used for obtaining the encryption key with a sole computer in the scope of the protocol used by Hitag2. In recent years there have been some studies developed by other researchers about implementing or even breaking cryptographic algorithms using CUDA technology (see e.g. [7–10]). However, none of those studies is focused on Hitag2. Besides, those studies do not compare CUDA implementations with the Java and C++/OpenMP versions of the algorithm being tested. This contribution represents an extension of the work presented in [11], which studied the legacy encryption algorithm KeeLoq instead of Hitag2. In addition to that, in this study two CUDA cards (GTX 950 and GTX 1070) have been used instead of only one in order to extend the research to medium cost hardware. The rest of this paper is organized as follows: in Section 2, we present a brief overview of the Hitag2 algorithm. Section 3 describes the CUDA, C++/OpenMP and Java platforms used in the comparison. In Section 4, we offer a description of our two implementations for those platforms, including relevant code of the optimized CUDA version. The results obtained are located in Section 5. Finally, our conclusions are presented in Section 6. 2 Hitag2 2.1 Algorithm Hitag2 is a stream cipher that consists of an internal 48-bit Linear Feedback Shift Register (LFSR) and a nonlinear filter function |$f$|⁠, as it can be observed in Figures 1 and 2. Hitag2 is the successor of Crypto1, another proprietary encryption algorithm created by NXP Semiconductors specifically for Mifare Radio Frequency Identification tags. Figure 1. View largeDownload slide Hitag2 initialization phase. Figure 1. View largeDownload slide Hitag2 initialization phase. Figure 2. View largeDownload slide Hitag2 operation phase. Figure 2. View largeDownload slide Hitag2 operation phase. In addition to the 48-bit key, this cipher uses a 32-bit serial number and a 32-bit Initialization Vector (IV). After a set-up phase of 32 cycles, the cipher works in an autonomous mode where the content of the register defines both the next encryption bit and how the register is updated. Thus, the total number of cycles is defined by the length of the bitstream that needs to be encrypted. The filter function |$f$| consists of three different functions: |$f_{a}$|⁠, |$f_{b}$| and |$f_{c}$|⁠. While |$f_{a}$| and |$f_{b}$| take as input four bits and produce as output one bit, |$f_{c}$| uses five bits in order to generate the final result in the form of a single bit. The three functions, which are used both in the initialization phase and the operation phase, can be modelled as Boolean tables allowing easy implementations, so the output of those functions for the input |$i$| is the |$i$|-th bit of the values given below: \begin{align*} f_{a}(i) &= (0\textrm{x}2C79)_{i},\\ f_{b}(i) &= (0\textrm{x}6671)_{i},\\ f_{c}(i) &= (0\textrm{x}7907287B)_{i}. \end{align*} In the initialization phase (see Figure 1), the register is initially filled with the 32 bits of the serial number and the first 16 bits of the key. If the serial number is expressed as |$id_i$||$(0 \leq i \leq 31)$| and the key is expressed as |$k_i$||$(0 \leq i \leq 48)$|⁠, the register bits |$r_i$||$(0 \leq i \leq 47)$| adopt the following initial state: \begin{align*} a_{i} \ &= id_i (0 \leq i \leq 31),\\ a_{32+i} \ &= k_i (0 \leq i \leq 15). \end{align*} In each cycle, the bit generated by |$f_{c}$| is XORed with the corresponding bits of the IV and the key, generating a bit that is inserted in the register at position 47, shifting in the process the register one bit to the left. The new bit is computed according to the expression |$f_c \oplus id_{i} \oplus k_{i+16}$|⁠, where |$0 \leq i \leq 31$|⁠. In the operation phase (see Figure 2), the new bit of the keystream is directly the output of |$f_c$|⁠, while the bit inserted at the register at position 47 in each cycle is the result of the concatenated XOR operations |$r_{0} \oplus r_{2} \oplus r_{3} \oplus r_{6} \oplus r_{7} \oplus r_{8} \oplus r_{16} \oplus r_{22} \oplus r_{23} \oplus r_{26} \oplus r_{30} \oplus r_{41} \oplus r_{42} \oplus r_{43} \oplus r_{46} \oplus r_{47} \oplus $|⁠. In order to decrypt a data protected with this algorithm, the receiver needs to know both the key and the IV. Alternatively, decryption can be achieved by using the key and the result of XORing the values |$f_c$| and the IV during the initialization phase, as it can be deduced from Figure 1. This feature will be used in our fine-tuned implementation. 2.2 Protocol Hitag2 has been widely used in automotive Remote Keyless Entry (RKE) and Passive Keyless Entry (PKE) systems. An RKE system consists of an radio frequency transmitter embedded into a car key that sends a short burst of digital data to a receiver in the vehicle, where it is decoded. In this context, users have to actively initiate the authentication process by pressing a button in their car key. The frequency used by RKE systems is 315 MHz in the USA and Japan and 433 MHz in Europe. In comparison, in PKE systems users are able to automatically unlock their cars when they approach the vehicle without having to actively press any button, as a bidirectional communication takes place between the car key and the vehicle when the transmitter is within the system’s range. PKE systems typically operate at the frequency of 125 KHz. In the PKE protocol analysed in this contribution, which was reversed engineered and published online in 2008 [12], the communication between a reader (the vehicle) and a transponder (typically embedded in the car key) starts with the reader, which sends an authenticate command to the transponder, as illustrated in Figure 3. Figure 3. View largeDownload slide Hitag2 protocol. Figure 3. View largeDownload slide Hitag2 protocol. Upon reception of this command, the transponder replies with a 32-bit message containing its serial number. Then, the reader generates a 32-bit IV and uses that value, together with the 48-bit key belonging to the transponder, in order to encrypt the hexadecimal value FFFFFFFF, which will be represented as |$\overline{\mbox{FFFFFFFF}}$|⁠. Next, the reader sends that encrypted element together with the encrypted IV to the transponder (the encrypted IV, denoted as |$\overline{\mbox{IV}}$|⁠, is the result of XORing the IV and the first 32 bits provided by the |$f_c$| function during the initialization phase). If the transponder is able to recover the FFFFFFFF value from the two elements sent by the reader, then it will send as a reply to the reader some configuration bytes only known to both of them in encrypted form [1, 13]. In this way, both the reader and the transponder are authenticated by the other participant in the communication. This protocol provides an easy attack scheme, as any eavesdropper is able to obtain both the plaintext and the ciphertext from the protocol’s operation. Having access to the element |$\overline{\mbox{IV}}$| is not a problem for the attacker, as that value can be directly used in the decryption. As the number of keys is larger than the number of possible ciphertexts (48 bits versus 32 bits), an attacker will be able to compute many keys which convert the same plaintext into the same ciphertext. Thus, a brute force attack such as the one described in this contribution needs an additional step in order to correlate the keys obtained from several encryption pairs. 3 Programming platforms 3.1 CUDA In past years, one of the dominant trends in microprocessor architectures has been the continuous increment of the chip-level parallelism and, as a result of that, multi-core central processing units (CPUs) providing 8–16 scalar cores are now commonplace. However, GPUs have been at the leading edge of this drive towards increased chip-level parallelism, General-Purpose computing on Graphics Processing Units (GPGPU) being the term that refers to the use of a GPU card to perform computations in applications traditionally managed by the CPU. Because of their particular hardware architecture, GPUs are able to compute certain types of parallel tasks quicker than multi-core CPUs, which has motivated their usage in scientific and engineering applications [14]. The disadvantage of using GPUs in those scenarios is their higher power consumption compared to that of traditional CPUs [15]. CUDA is the best known GPU-based parallel computing platform and programming model, created by NVIDIA. CUDA is designed to work with C, C++, Fortran and programming frameworks such as OpenACC or OpenCL, though with some limitations. CUDA organizes applications as a sequential host program that may execute parallel programs, referred to as kernels, on a CUDA-capable device. The compute capability specifies characteristics such as the maximum number of resident threads or the amount of shared memory per multi-processor, which can significantly vary from one version to another (and, consequently, from one graphics card to another) [16]. In order to work with CUDA applications, the programmer needs to copy data from host memory to device memory, invoke kernels and then copy data back from device memory to host memory. 3.2 C++ and OpenMP C++ is a programming language designed by Bjarne Stroustrup in 1983 and is standardized since 1998 by the International Organization for Standardization (ISO). The latest version is known as C++17 [17]. OpenMP is an Application Programming Interface that supports shared-memory parallel programming in C, C++ and Fortran on several platforms, including GNU/Linux, OS X and Windows. The latest stable version is 4.5, released on November 2015 [18]. When using OpenMP, the section of code that is intended to run in parallel is marked with a preprocessor directive that will cause the threads to form before the section is executed. By default, each thread executes the parallelized section of code independently. The runtime environment allocates threads to processors depending on usage, machine load and other factors. 3.3 Java The Java programming language was originated in 1990 when a team at Sun Microsystems was working first in the design and development of software for small electronic devices and later in the emerging market of Internet browsing. Once the first official version of Java was launched in 1996, its popularity started to increase exponentially. Currently there are more than 10 million Java developers and, according to [19], more than 15 billion devices (mainly personal computers, mobile phones and smart cards) run Java. On January 2010, Oracle Corporation completed the acquisition of Sun Microsystems [20], so at this moment the Java technology is managed by Oracle. The latest version, known as Java 9, was launched in September 2017. Between November 2006 and May 2007, Sun Microsystems released most of the Java components under the GNU (GNU’s Not Unix!) General Public License model through the OpenJDK project [21], so virtually all the pieces of the Java language are currently free open source software. 4 Implementations 4.1 First implementation The first implementation of the brute-force attack is a direct implementation, in the sense that it completes all the steps needed to be performed by the attacker. This means that, taking the encrypted data and the IV as input, the implementation performs the 32 steps of the initialization phase and the 32 steps of the operation phase for all the keys that are tests. It is important to point out that this implementation has been used primarily as a comparison element with regards to the second implementation. 4.2 Second implementation The second implementation takes advantage of some peculiarities of the encryption algorithm and the protocol that allow to increase the performance of the brute-force attack. The first improvement consists in directly using the |$\overline{\mbox{IV}}$| element (instead of using the original IV), as that data can be easily obtained by the attacker during one of the steps of the protocol. By using the |$\overline{\mbox{IV}}$|⁠, the attacker does not need to complete the initialization phase for each key, replacing that operation by the XOR of the |$\overline{\mbox{IV}}$| and a 32-bit portion of the key (bits 16 to 47) in order to produce the 32 bits that are located in the right-hand side of the register at the start of the operation phase. Using this improvement, the running time is practically halved. The second improvement derives from the fact that the plaintext is known to the attacker and its value is the aforementioned hexadecimal value FFFFFFFF. As in the operation phase each round of the algorithm generates a bit of the keystream, it is possible to discard a candidate key as soon as the keystream bit generated by one round does not generate a bit 1 when XORed with the corresponding bit of the encrypted value. This means that, in most cases, only a few rounds of the operation phase are completed in comparison with the full 32 rounds that were completed for each key in the first implementation. By introducing this feature, we have been able to roughly divide the running time by 16. The code displayed in Listing 1.1 contains the details of second version of the CUDA kernel, where one key is tested by each thread. Listing 1.1. View largeDownload slide Portion of code belonging to the CUDA implementation. Listing 1.1. View largeDownload slide Portion of code belonging to the CUDA implementation. 5 Results The tests whose results are presented in this section were completed using the following equipment [22]: — A PC with an Intel Core i7 processor model 3370 at 3.40 GHz; — A GeForce GTX 950 card, which is a low tier GPU with 768 processor cores, a base clock of 1024 MHz, a memory bandwidth of 105.6 GB/s and a floating point performance of 1.85 TeraFLOPS [23]; — A GeForce GTX 1070 card, which is a medium-tier GPU with 1920 processor cores, a base clock of 1506 MHz, a memory bandwidth of 256.3 GB/s and a floating point performance of 6.46 TeraFLOPS [24]. While the CUDA and C++/OpenMP applications have been compiled with Visual Studio 2010 and 2017 (the application for the older GeForce GTX 950 was compiled with Visual Studio 2010), the Java application has been compiled with NetBeans 8.0 using the JDK (Java Development Kit) version 1.8.0-141. In all the tests that have been performed, each application had to check the first |$2^{34}$| possible keys (an arbitrary value large enough in order to obtain valid conclusions) using an encryption/decryption pair generated with the following values: — Serial number: 0x87654321. — IV: 0x75B5DE65. — Plaintext: 0xFFFFFFFF. — Ciphertext: 0x1CE18551. 5.1 Results of the first implementation Table 1 shows the running time in seconds of the C++/OpenMP and Java implementations when using a different number of concurrent threads. Tables 2 and 3 show the running time of the first CUDA application when executed on the GeForce GTX 950 with different grid sizes and a constant block size of 512 and 1024, respectively. Finally, Tables 4 and 5 include the running time of the first CUDA application when executed on the GeForce GTX 1070 with different grid sizes and a constant block size of 512 and 1024, respectively. Table 1. Running time in seconds using the C++ and Java multi-threaded first implementation Number of threads 1 2 4 8 16 32 C++ 18126.60 9084.68 4625.80 3749.45 3748.61 3747.32 Java 17548.88 8461.70 4496.55 3744.72 3694.46 3817.03 Number of threads 1 2 4 8 16 32 C++ 18126.60 9084.68 4625.80 3749.45 3748.61 3747.32 Java 17548.88 8461.70 4496.55 3744.72 3694.46 3817.03 View Large Table 1. Running time in seconds using the C++ and Java multi-threaded first implementation Number of threads 1 2 4 8 16 32 C++ 18126.60 9084.68 4625.80 3749.45 3748.61 3747.32 Java 17548.88 8461.70 4496.55 3744.72 3694.46 3817.03 Number of threads 1 2 4 8 16 32 C++ 18126.60 9084.68 4625.80 3749.45 3748.61 3747.32 Java 17548.88 8461.70 4496.55 3744.72 3694.46 3817.03 View Large Table 2. Running time in seconds using the first CUDA implementation with a block size of 512 on the GeForce GTX 950 card Grid size 512 1024 2048 4096 8192 16384 Running time 197.62 196.72 194.92 193.97 193.65 193.43 Grid size 512 1024 2048 4096 8192 16384 Running time 197.62 196.72 194.92 193.97 193.65 193.43 View Large Table 2. Running time in seconds using the first CUDA implementation with a block size of 512 on the GeForce GTX 950 card Grid size 512 1024 2048 4096 8192 16384 Running time 197.62 196.72 194.92 193.97 193.65 193.43 Grid size 512 1024 2048 4096 8192 16384 Running time 197.62 196.72 194.92 193.97 193.65 193.43 View Large Table 3. Running time in seconds using the first CUDA implementation with a block size of 1024 on the GeForce GTX 950 card Grid size 512 1024 2048 4096 8192 16384 Running time 194.93 193.68 192.72 192.22 191.96 191.87 Grid size 512 1024 2048 4096 8192 16384 Running time 194.93 193.68 192.72 192.22 191.96 191.87 View Large Table 3. Running time in seconds using the first CUDA implementation with a block size of 1024 on the GeForce GTX 950 card Grid size 512 1024 2048 4096 8192 16384 Running time 194.93 193.68 192.72 192.22 191.96 191.87 Grid size 512 1024 2048 4096 8192 16384 Running time 194.93 193.68 192.72 192.22 191.96 191.87 View Large Table 4. Running time in seconds using the first CUDA implementation with a block size of 512 on the GeForce GTX 1070 card Grid size 512 1024 2048 4096 8192 16384 Running time 53.33 50.68 49.39 48.63 48.30 48.12 Grid size 512 1024 2048 4096 8192 16384 Running time 53.33 50.68 49.39 48.63 48.30 48.12 View Large Table 4. Running time in seconds using the first CUDA implementation with a block size of 512 on the GeForce GTX 1070 card Grid size 512 1024 2048 4096 8192 16384 Running time 53.33 50.68 49.39 48.63 48.30 48.12 Grid size 512 1024 2048 4096 8192 16384 Running time 53.33 50.68 49.39 48.63 48.30 48.12 View Large Table 5. Running time in seconds using the first CUDA implementation with a block size of 1024 on the GeForce GTX 1070 card Grid size 512 1024 2048 4096 8192 16384 Running time 51.38 49.60 48.90 48.51 48.32 48.21 Grid size 512 1024 2048 4096 8192 16384 Running time 51.38 49.60 48.90 48.51 48.32 48.21 View Large Table 5. Running time in seconds using the first CUDA implementation with a block size of 1024 on the GeForce GTX 1070 card Grid size 512 1024 2048 4096 8192 16384 Running time 51.38 49.60 48.90 48.51 48.32 48.21 Grid size 512 1024 2048 4096 8192 16384 Running time 51.38 49.60 48.90 48.51 48.32 48.21 View Large 5.2 Results of the second implementation Table 6 shows the running time in seconds of the C++/OpenMP and Java implementations when using a different number of concurrent threads in the second implementation. Tables 7 and 8 show the running time of the second CUDA application when executed on the GeForce GTX 950 with different grid sizes and a constant block size of 512 and 1024, respectively. Finally, Tables 9 and 10 include the running time of the second CUDA application when executed on the GeForce GTX 1070 with different grid sizes and a constant block size of 512 and 1024, respectively. Table 6. Running time in seconds using the C++ and Java multi-threaded second implementation Number of threads 1 2 4 8 16 32 C++ 543.08 277.27 167.64 127.85 123.57 119.97 Java 525.89 268.76 161.10 132.67 123.96 121.29 Number of threads 1 2 4 8 16 32 C++ 543.08 277.27 167.64 127.85 123.57 119.97 Java 525.89 268.76 161.10 132.67 123.96 121.29 View Large Table 6. Running time in seconds using the C++ and Java multi-threaded second implementation Number of threads 1 2 4 8 16 32 C++ 543.08 277.27 167.64 127.85 123.57 119.97 Java 525.89 268.76 161.10 132.67 123.96 121.29 Number of threads 1 2 4 8 16 32 C++ 543.08 277.27 167.64 127.85 123.57 119.97 Java 525.89 268.76 161.10 132.67 123.96 121.29 View Large Table 7. Running time in seconds using the second CUDA implementation with a block size of 512 on the GeForce GTX 950 card Grid size 512 1024 2048 4096 8192 16384 Running time 11.20 8.96 8.31 7.68 7.40 7.25 Grid size 512 1024 2048 4096 8192 16384 Running time 11.20 8.96 8.31 7.68 7.40 7.25 View Large Table 7. Running time in seconds using the second CUDA implementation with a block size of 512 on the GeForce GTX 950 card Grid size 512 1024 2048 4096 8192 16384 Running time 11.20 8.96 8.31 7.68 7.40 7.25 Grid size 512 1024 2048 4096 8192 16384 Running time 11.20 8.96 8.31 7.68 7.40 7.25 View Large Table 8. Running time in seconds using the second CUDA implementation with a block size of 1024 on the GeForce GTX 950 card Grid size 512 1024 2048 4096 8192 16384 Running time 9.29 8.37 7.75 7.45 7.34 7.25 Grid size 512 1024 2048 4096 8192 16384 Running time 9.29 8.37 7.75 7.45 7.34 7.25 View Large Table 8. Running time in seconds using the second CUDA implementation with a block size of 1024 on the GeForce GTX 950 card Grid size 512 1024 2048 4096 8192 16384 Running time 9.29 8.37 7.75 7.45 7.34 7.25 Grid size 512 1024 2048 4096 8192 16384 Running time 9.29 8.37 7.75 7.45 7.34 7.25 View Large Table 9. Running time in seconds using the second CUDA implementation with a block size of 512 on the GeForce GTX 1070 card Grid size 512 1024 2048 4096 8192 16384 Running time 5.16 3.81 3.03 2.73 2.54 2.43 Grid size 512 1024 2048 4096 8192 16384 Running time 5.16 3.81 3.03 2.73 2.54 2.43 View Large Table 9. Running time in seconds using the second CUDA implementation with a block size of 512 on the GeForce GTX 1070 card Grid size 512 1024 2048 4096 8192 16384 Running time 5.16 3.81 3.03 2.73 2.54 2.43 Grid size 512 1024 2048 4096 8192 16384 Running time 5.16 3.81 3.03 2.73 2.54 2.43 View Large Table 10. Running time in seconds using the second CUDA implementation with a block size of 1024 on the GeForce GTX 1070 card Grid size 512 1024 2048 4096 8192 16384 Running time 3.90 3.32 2.77 2.58 2.45 2.38 Grid size 512 1024 2048 4096 8192 16384 Running time 3.90 3.32 2.77 2.58 2.45 2.38 View Large Table 10. Running time in seconds using the second CUDA implementation with a block size of 1024 on the GeForce GTX 1070 card Grid size 512 1024 2048 4096 8192 16384 Running time 3.90 3.32 2.77 2.58 2.45 2.38 Grid size 512 1024 2048 4096 8192 16384 Running time 3.90 3.32 2.77 2.58 2.45 2.38 View Large Figure 4 shows a graphic representation of the best running time obtained with the second implementation by each platform. Figure 4. View largeDownload slide Running time comparison. Figure 4. View largeDownload slide Running time comparison. 6 Conclusions In this contribution we have compared the computer capability of several hardware and software technologies using as an example a cryptographic brute-force attack on the legacy algorithm Hitag2. More specifically, we have compared two versions of a CUDA application, a C++ implementation using the OpenMP library and a Java application that uses the multi-threading capabilities provided by the language. These tests have shown that, with the best configuration in each case, the native C++/OpenMP application provides a performance only slightly better than the performance of the interpreted Java code. Given that the code of each language was very similar, the most probable explanation is the use of basic data types in both cases, which allowed us to avoid slow-performance Java classes such as BigInteger. In addition to that, it is important to take into account that Java’s Just-In-Time (JIT) compiler improves the performance by compiling Java bytecodes into native machine code at run time [25], which would also help to explain the similarity of its performance to the one obtained with the C++/OpenMP code. As the multi-threading capabilities are available in Java by default, without having to add any third-party library, it can be stated that Java is a viable alternative to C++ for this kind of developments. Regarding the increase in the performance when commanding more threads, the tests show that the improvement is tightly related to the number of physical cores, not to the number of logical cores (the processor used in the tests has four physical cores and eight logical cores). Considering all the results, the superiority of CUDA cards with respect to advanced CPUs for certain intensive computing tasks is clear. The best result obtained with the GeForce GTX 1070 provides a performance almost 50 times better than that of the C++/OpenMP implementation when using the full capacity of the i7 3370 processor. Even in the case of using some of the most powerful CPUs available today, such as Intel’s i9-7980XE (18 physical cores with a price around $2,000) or AMD’s Ryzen Threadripper 1950X (16 physical cores with a price around $1,000), the performance would not achieve by far the levels obtained with a CUDA card comparatively cheaper. Regarding the comparison between the two CUDA devices, the results are aligned with the technical capabilities of the cards such as the number of cores, the memory bandwidth and the floating-point performance rate. In both cases, the best results are obtained when using a bigger grid size, which allows to perform fewer calls to the CUDA code and avoid some latency issues. Using the best result obtained with the CUDA versions, it can be extrapolated that the whole set of |$2^{48}$| keys could be tested in less than half a day. This result could be vastly improved when using other CUDA cards such as the Tesla P100 (which has 3584 processor cores and a floating point performance of 10.6 TeraFLOPS), GeForce GTX 1080 Ti (3584 processor cores and 11.3 TeraFLOPS) or TITAN V (5120 processor cores and 14.9 TeraFLOPS) [26, 27]. Acknowledgements This work has been partially supported by Ministerio de Economía, Industria y Competitividad (MINECO), Agencia Estatal de Investigación (AEI) and Fondo Europeo de Desarrollo Regional (FEDER, UE) under project COPCIS and (TIN2017-84844-C2-1-R) and by Comunidad de Madrid (Spain) under project CIBERDINE (S2013/ICE-3095-CM), also co-financed by European Union FEDER funds. References [1] C. Li , H. Wu , S. Chen , X. Li and D. Guo . Efficient implementation for MD5-RC4 encryption using GPU with CUDA . In 2009 3rd International Conference on Anti-counterfeiting, Security, and Identification in Communication , Institute of Electrical and Electronics Engineers (IEEE) , pp. 167 – 170 , 2009 . [2] F. D. Garcia , D. Oswald , T. Kasper and P. Pavlidès . Lock it and still lose it—on the (in)security of automotive remote keyless entry systems . In 25th USENIX Security Symposium (USENIX Security 2016) , The USENIX Association , 929 – 944 , 2016 . [3] G. Agosta , A. Barenghi and G. Pelosi . High speed cipher cracking: the case of Keeloq on CUDA . In 3rd Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures (PARMA 2012) , Gesellschaft für Informatik e.V. pp. 1 – 7 , 2012 [4] I. Wiener . Philips/NXP Hitag2 PCF7936/46/47/52 stream cipher reference implementation. WayBackMachine web site, 2008 . [5] International Organization for Standardization . ISO/IEC 14882:2017 . ISO web site , 2017 . [6] K. Scharfglass , D. Weng , J. White and C. Lupo . Breaking weak 1024-bit RSA keys with CUDA . In 2012 13th International Conference on Parallel and Distributed Computing, Applications and Technologies , pp. 207 – 212 , 2012 . [7] N. T. Courtois , S. O’Neil and J. J. Quisquater . Practical algebraic attacks on the Hitag2 stream cipher . In Information Security: 12th International Conference (ISC 2009) , Samarati, P., Young, M., Martinelli, F., Ardagna, C.A. eds. Springer Berlin Heidelberg , pp. 167 – 176 , 2009 . [8] N. T. Courtois , S. O’Neil and J. J. Quisquater . Cube cryptanalysis of Hitag2 stream cipher . In International Conference on Cryptology and Network Security (CANS 2011) , Lin D., Tsudik G., Wang X. eds. Springer Berlin Heidelberg , pp. 15 – 25 , 2011 . [9] NVIDIA Corp. Leader in AI computing , 2016 . [10] NVIDIA Corp. Programming Guide . NVIDIA web site , 2016 . [11] NVIDIA Corp. CUDA Legacy GPUs . NVIDIA web site , 2018 . [12] NVIDIA Corp. GeForce GTX 950—Specifications . NVIDIA web site , 2017 . [13] NVIDIA Corp. GeForce GTX 1070—Specifications . NVIDIA web site , 2017 . [14] NVIDIA Corp. The world’s most powerful PC GPU . NVIDIA web site , 2018 . [15] OpenMP . The OpenMP API specification for parallel programming , 2016 . [16] Oracle Corp. Go Java homepage , 2018 . [18] Oracle Corp. OpenJDK homepage . 2018 . [19] Oracle Corp. Understanding just-in-time compilation and optimization . Oracle web site , 2011 . [20] P. Stembera and M. Novotny . Breaking Hitag2 with reconfigurable hardware . In 14th Euromicro Conference on Digital System Design (DSD 2011) , pp. 558 – 563 , 2011 . [21] Q. Li , C. Zhong , K. Zhao , X. Mei and X Chu . Implementation and analysis of AES encryption on GPU . In IEEE 14th International Conference on High Performance Computing and Communication and IEEE 9th International Conference on Embedded Software and Systems , pp. 843 – 848 , 2012 . [22] R. Verdult , F. D. Garcia and J. Balasch . Gone in 360 seconds: hijacking with Hitag2 . In 21st USENIX Security Symposium (USENIX Security 2012) , pp. 237 – 252 , 2012 . [23] R. Verdult . The (In)security of Proprietary Cryptography . PhD Thesis , Radboud University Nijmegen , The Netherlands , 2015 . [24] S. Mittal and J.S. Vetter . A survey of methods for analyzing and improving GPU energy efficiency . ACM Computing Surveys , 47 , 1 – 23 , 2014 . Google Scholar Crossref Search ADS [25] T. Guneysu , T. Kasper , M. Novotny , C. Paar and A. Rup . Cryptanalysis with COPACOBANA . IEEE Transactions on Computers 57 , 1498 – 1513 , 2008 . Google Scholar Crossref Search ADS [26] TechPowerUp . NVIDIA TITAN V . TechPowerUp web site , 2018 . [27] V. Gayoso Martínez , L. Hernández Encinas , A. Martín Muñoz , O. Martínez-Graullera and J. Villazón-Terrazas . A comparison of computer-based technologies suitable for cryptographic attacks . In International Joint Conference SOCO’16-CISIS’16-ICEUTE’16 , pp. 621 – 630 , 2016 . © The Author(s) 2018. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permission@oup.com. This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model) TI - Comparing low and medium cost computer-based technologies suitable for cryptographic attacks JF - Logic Journal of the IGPL DO - 10.1093/jigpal/jzy031 DA - 2019-03-25 UR - https://www.deepdyve.com/lp/oxford-university-press/comparing-low-and-medium-cost-computer-based-technologies-suitable-for-9XyoyiOAVr SP - 177 VL - 27 IS - 2 DP - DeepDyve ER -