Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 7-Day Trial for You or Your Team.

Learn More →

Base41: A proposal for printable encoding of bit strings

Base41: A proposal for printable encoding of bit strings INTRODUCTIONIn the field of computer science there are many applications that need a representation of data in printable form, that is, encoded with a set of characters that can be printed and read by a human. Examples of such software and hardware are some electronic mail clients and servers or QR‐code readers. To cope with these systems many frameworks have been developed to translate (i.e., encode and decode) data from binary to printable form and vice versa.In the following we will introduce such an encoding system and compare it with existing ones.In this context encoding means representing a number written as a numeral in a certain system with a numeral written in another system (e.g., the number of legs of a dog is represented with, among others, the numerals 4$$ 4 $$ (Arabic numerals), IV (Roman numerals), four (English language)). Within these systems, positional systems have a noticeable relevance. A positional system uses a base B (with B >1$$ >1 $$, B ∈ℕ$$ \in \mathbb{N} $$) having B$$ B $$ symbols to represent numbers as a weighted sum of integer powers of B, that is, a number M is decomposed in An Bn + An‐1 Bn‐1 + … + A0 B0 and written as AnAn‐1An‐2 …A0.In the following the term Base Y refers to a method that uses a base of Y symbols.The objective of this article is to present an encoding format of binary data by means of printable symbol sequences using an alphabet of 41$$ 41 $$ symbols.The proposal in this article recalls the Base451 encoding and the Base41 proposal.2 As in Reference 1 this proposal encodes two octets with three symbols and also allows for encoding one octet or a shorter bit string in three symbols.Differently from Reference 3, where the ten digits and the uppercase and lowercase letters of the English alphabet (totaling 62$$ 62 $$ symbols) are used in an encoding system, called UTF‐62, for multilingual identifiers, the present proposal employs 41$$ 41 $$ letters only: motivation for the use of 41$$ 41 $$ symbols is that 41$$ 41 $$ is the minimum base that may be used to encode 2$$ 2 $$ octets (216=65,536$$ {2}^{16}=\mathrm{65,536} $$ configurations) with 3$$ 3 $$ symbols, that is, 403=64,000<216<413=68,921$$ {40}^3=\mathrm{64,000}<{2}^{16}<{41}^3=\mathrm{68,921} $$. Moreover, part of the exceeding configurations allows for the representation of bit strings with length up to 8$$ 8 $$ bits. The alphabet is composed by the 20$$ 20 $$ uppercase letters “ABCDFGHJKLMNQRSTUVXZ” and the 21$$ 21 $$ lowercase letters “abcdefhikmnopqrstuvxz” from the English alphabet excluding “EIOPWYgjlwy”: this has the effect of using only URL‐safe characters whose graphical representation does not give rise to ambiguities in the visual interpretation of the glyphs by a human. In fact, as done in the Base58 encoding4 for some letters, the proposed Base41 encoding avoids the possible visual uncertainty in the printed string between:“Q” (capital q), “O” (capital o), and “0” (digit zero),“B” and “E”,“E” and “F”,“P” and “R”,“g” (lowercase G) and “q” (lowercase Q),“l” (lowercase L), “I” (capital i), and “1” (digit one),“i” (lowercase I) and “j” (lowercase J),“vv” and “w”, both lowercase and uppercase,“v” and “y”, both lowercase and uppercase.The main differences with Base451 encoding are:a smaller set of symbols;the use of uppercase and lowercase letters from the English alphabet;the use of URL‐safe characters only;a smaller number of unused encoded sequences (considered to be rejected in Reference 1);limiting the similar glyphs in the encoding alphabet;the mode of encoding of octet strings of odd octet length;the encoding of bit strings of any length (not necessarily an integer number of octets).The presented Base41 encoding differs from the one in Reference 2 for the advantages of the alphabet chosen to represent the 41$$ 41 $$ symbols, which uses URL‐safe characters only (that is, characters that have no special meaning or function in coding URLs and thus do not need to be escaped with the % sign), and for the possibility to uniformly encode bit strings of any length.In addition to the cited works on the bases 41$$ 41 $$, 45$$ 45 $$, 58,$$ 58, $$ and 62$$ 62 $$, it is important to note that in previous years many works have been published on the theme of encoding binary strings using a printable format.The base 62$$ 62 $$ is also used in Reference 5 to build a printable encoding from a bit stream that is read 6$$ 6 $$ bit at a time. Care is taken in case the input stream has a bit length not multiple of 6$$ 6 $$ performing the proper padding.The Base646 encoding represents three octets with four symbols leading to a compact representation that uses the lowercase and uppercase letters of the English alphabet, the ten digits and two special characters. The same document6 defines two more encodings, one based on 16$$ 16 $$ symbols and the other using 32$$ 32 $$ symbols.An encoding that uses 85$$ 85 $$ symbols (the ASCII characters from code 33$$ 33 $$ to code 117$$ 117 $$) to represent 4$$ 4 $$ octets with 5$$ 5 $$ characters has been proposed in Reference 7, where the encoding is called Ascii85.A readable representation of IPV6 addresses is also obtained with a base with 85$$ 85 $$ symbols: the used alphabet and the encoding/decoding are defined in Reference 8.Two works,9,10 use the base 91$$ 91 $$. Reference 9 represents blocks of 13$$ 13 $$ bits with pairs of signs from an alphabet of 91$$ 91 $$ characters that is a subset of printable ASCII symbols; some pairs are used to indicate how many bits to discard in the last block in case it has a length different from 13$$ 13 $$ bits. Differently from Reference 9, which has some unused Base91 pairs, Reference 10 makes use of all the pairs to encode blocks of 13$$ 13 $$ bits: in some cases, blocks of 14$$ 14 $$ bits are encoded saturating all the available 912$$ {91}^2 $$ configurations.Base 36$$ 36 $$ is also frequently used in many programming languages (e.g., Python,11 PHP,12 Javascript13) that have routines for its conversions: in general, the alphabet is composed by the 26$$ 26 $$ letters of the English alphabet and the 10$$ 10 $$ digits.The structure of the article is the following: first, we introduce some notation used throughout the article. Section 2 presents the proposed Base41 alphabet. Section 3 illustrates the encoding and decoding procedures detailing them with pseudo‐code algorithms; Section 4 shows some encoding and decoding examples. In Section 5 some considerations about the protocol and security issues are discussed, and Section 6 presents some details regarding the implementation and provides experimental results. Finally, Section 7 draws some conclusions.NotationIn the rest of the article the following variables will be used:C1, C2, C3 represent Base41 symbols; an instantiation of a Base41 symbol is written with a different font like “x” or “z”;M is a number to be converted from binary to Base41 and vice versa;N1, N2 are nibbles, that is, sequences of 4$$ 4 $$ bits;O1, O2 are octets, that is, sequences of 8$$ 8 $$ bits;P1, P2 represent numbers of 5$$ 5 $$ bits;V1, V2, V3 represent numeric values of Base41 symbols ranging from 0$$ 0 $$ to 40$$ 40 $$.BASE41 ALPHABETThe Base41 alphabet is composed by the following 41$$ 41 $$ letters (note that no digits are used):ABCDFGHJKLMNQRSTUVXZabcdefhikmnopqrstuvxzEach letter is associated to a numerical value according to Table 1. To convert a number to base 41$$ 41 $$ any method developed for such purpose may be used: for example, using successive divisions by 41$$ 41 $$ the obtained sequence of remainders will be made of numbers with values between 0$$ 0 $$ and 40$$ 40 $$; each value is used as an index in the sequence of letters of the proposed Base41 alphabet and the extracted letters are concatenated to get the Base41 representation of the number (in the present proposed alphabet).1TABLEProposed Base41 symbol tableValueBase41 symbolValueBase41 symbolValueBase41 symbolValueBase41 symbol0$$ 0 $$A10$$ 10 $$M20$$ 20 $$a30$$ 30 $$n1$$ 1 $$B11$$ 11 $$N21$$ 21 $$b31$$ 31 $$o2$$ 2 $$C12$$ 12 $$Q22$$ 22 $$c32$$ 32 $$p3$$ 3 $$D13$$ 13 $$R23$$ 23 $$d33$$ 33 $$q4$$ 4 $$F14$$ 14 $$S24$$ 24 $$e34$$ 34 $$r5$$ 5 $$G15$$ 15 $$T25$$ 25 $$f35$$ 35 $$s6$$ 6 $$H16$$ 16 $$U26$$ 26 $$h36$$ 36 $$t7$$ 7 $$J17$$ 17 $$V27$$ 27 $$i37$$ 37 $$U8$$ 8 $$K18$$ 18 $$X28$$ 28 $$k38$$ 38 $$v9$$ 9 $$L19$$ 19 $$Z29$$ 29 $$m39$$ 39 $$x‐‐‐‐‐‐40$$ 40 $$zLet us see an example: given the decimal numeral 2,357,293$$ \mathrm{2,357,293} $$ to convert it to Base41 we start dividing it by 41$$ 41 $$ obtaining 57,494$$ \mathrm{57,494} $$ and remainder 39$$ 39 $$; the 39th letter (starting from 0$$ 0 $$) in the proposed Base41 alphabet is x. Then divide 57,494$$ \mathrm{57,494} $$ by 41$$ 41 $$ obtaining 1402$$ 1402 $$ and remainder 12$$ 12 $$; the 12th letter in the proposed Base41 alphabet is Q. Continuing to divide 1402$$ 1402 $$ by 41$$ 41 $$ we have 34$$ 34 $$ and remainder 8$$ 8 $$ which is associated to letter K. Dividing 34$$ 34 $$ by 41$$ 41 $$ results in 0$$ 0 $$ (stopping the iteration) with remainder 34$$ 34 $$ which is coded with r. Writing from left to right the letters obtained from last to first we have rKQx which is the Base41 numeral for the decimal numeral 2,357,293$$ \mathrm{2,357,293} $$. In case the numeral must be written with more symbols it may be left padded with the Base41 symbol A which has a corresponding value of 0$$ 0 $$: rKQx, ArKQx, AArKQx, AAArKQx, … all represent the same number.BASE41 ENCODING AND DECODINGBinary data encoding may be performed according to the data size. Base41 encoding allows the representation of pairs of octets (16$$ 16 $$ bits), single octets (8$$ 8 $$ bits) and groups of 1$$ 1 $$ to 7$$ 7 $$ bits. All these kinds of data are encoded in 3$$ 3 $$ Base41 symbols: thus, for sequences of a large number of bits the size of the resulting data is approximately 1.5$$ 1.5 $$ the dimension of the original binary data. A detailed analysis of this data expansion factor is reported in Appendix A.A pair of octets O1, O2 is interpreted as a 16$$ 16 $$ bits number M = O1 * 256$$ 256 $$ + O2 (first octet most significant). Then, M is converted in base 41$$ 41 $$ and represented by the three Base41 symbols C1, C2, C3, that is, M = C1 * 41$$ 41 $$ * 41$$ 41 $$ + C2 * 41$$ 41 $$ + C3 (first Base41 symbol most significant). The minimum 16$$ 16 $$ bits value, 0$$ 0 $$, is represented in Base41 as AAA. The maximum 16$$ 16 $$ bits value, 65,535$$ \mathrm{65,535} $$, is represented in Base41 as vzV. Algorithm 1 presents the encoding procedure pseudo‐code for 2$$ 2 $$ octets (see flow chart in Figure B1).Note that neither “x” nor “z” may occur as first symbol. These characters will be used as prefixes of sequences of three symbols that encode, respectively, bit strings and single octets.A single octet O1 is considered composed by two nibbles N1 and N2, N1 most significant, that is, O1 = N1 * 16$$ 16 $$ + N2. The decimal values of N1 and N2 are used as indexes in the Base41 table to obtain two symbols C1 and C2: these symbols are concatenated to the Base41 symbol “z” producing a string of three characters z C1 C2 to obtain the Base41 representation of the single octet. The minimum 8$$ 8 $$ bits value, 0$$ 0 $$, is represented in Base41 as zAA. The maximum 8$$ 8 $$ bits value, 255$$ 255 $$, is represented in Base41 as zTT. Algorithm 2 presents the encoding procedure pseudo‐code for a single octet (see flow chart in Figure B2).A bit string composed of 1$$ 1 $$ bit (b1), 2$$ 2 $$ bits (b1 b2), … or 7$$ 7 $$ bits (b1 b2 b3 b4 b5 b6 b7), leftmost most significant bit, is first represented with ten bits, the leading three bits encoding the length as in Table 2 (note the zero valued bits for filling the ten bits).1AlgorithmEncoding procedure pseudo‐code for two octets O1, O2beginM = O1 * 256$$ 256 $$ + O2Convert M to base 41$$ 41 $$: M = V1 * 412$$ {41}^2 $$ + V2 * 41$$ 41 $$ + V3Map the values V1, V2, V3 to the symbols C1, C2, C3 using Table 1Output C1, C2, C3end2AlgorithmEncoding procedure pseudo‐code for one octet O1beginExtract the two nibbles N1, N2 of O1, that is O1 = N1 * 16$$ 16 $$ + N2Use Table 1 to convert the values N1 and N2 to base 41, obtaining symbols C1 and C2Output “z”, C1, C2end2TABLERepresentation of bit strings for the proposed Base41 encoding using ten bits: The first two columns encode five bits that will be represented by a decimal number P1, the last column encodes the remaining five bits represented by a decimal number P2LengthBit string (zero filled)001b1000000010b1b200000011b1b2b30000100b1b2b3b4000101b1b2b3b4b500110b1b2b3b4b5b60111b1b2b3b4b5b6b7The decimal values P1 (representing the five bits in the first columns of Table 2) and P2 (representing the five bits in the last five columns of Table 2) of the two groups of five bits are used as indexes in the Base41 table to obtain two symbols C1 and C2: these symbols are concatenated to the Base41 symbol “x” to obtain the Base41 representation of the bit string. The minimum value single bit string, 0$$ 0 $$, is represented in Base41 as xFA. The maximum value seven‐bit string, 1111111$$ 1111111 $$, is represented in Base41 as xoo. Note that not all the intermediate values are possible, for example, a bit length of 2$$ 2 $$ with the trailing five bits valued 1$$ 1 $$ may not happen. Moreover, it is possible to extend this code to represent the empty string with a 0$$ 0 $$ bit length: the Base41 representation following the previous encoding mode will be xAA. The following Algorithm 3 presents the encoding procedure pseudo‐code for a bit string of maximum length 7$$ 7 $$ bits (see flow chart in Figure B3).Encoding of a bit stream should be performed by first considering it as a sequence of contiguous pairs of octets each one coded separately, then encoding the eventual trailing octet and then encoding the possibly remaining 1$$ 1 $$ to 7$$ 7 $$ bits. Nonetheless, an application may divide the input to its convenience and encode every part as a bit stream on its own.3AlgorithmEncoding procedure pseudo‐code for bit string b1, b2, …, bn, with 0<$$ 0< $$ n <8$$ <8 $$beginRepresent the bit string and its length according to Table 2Halve the ten bits into P1 and P2 and map their values to base 41$$ 41 $$ using Table 1 obtaining symbols C1 and C2Output “x”, C1, C2endAppendix B reports flow charts and examples of executions of these algorithms.One side effect of this encoding is that by inspecting the first character in each group of three symbols it is possible to immediately infer which kind of data is encoded (two octets, one octet or a bit string). In fact, decoding may be performed by splitting the stream of Base41 symbols in groups of 3$$ 3 $$ letters, then:if the first symbol of the group is “z” then it encodes a single octet; the values of the two nibbles of this octet are obtained from Table 1 using as indexes the second and third symbols of the group;if the first symbol of the group is “x” then it represents a bit string; two groups of five bits each are acquired from Table 1 using as indexes the second and third symbols of the group, then the resulting ten bits are decoded according to Table 2 to get the number of bits and their values to form the encoded bit string;otherwise, the three symbols encode two octets; first, the three symbols are transformed in numbers V1, V2, V3 according to Table 1 and then converted from base 41$$ 41 $$ as M = V1 * 41$$ 41 $$ * 41$$ 41 $$ + V2 * 41$$ 41 $$ + V3: if M <65,536$$ <\mathrm{65,536} $$ then M is the representation of the two octets (M ≥65,536$$ \ge \mathrm{65,536} $$ is an encoding error).The following Algorithm 4 shows the decoding procedure pseudo‐code. Please note that an error is issued in case of trailing ones when decoding a bit string, but this is only the suggested behavior.4AlgorithmDecoding procedure pseudo‐code of three symbols C1, C2, C3beginif C1 = “z” then          Use Table 1 to assign to N1, N2 the values corresponding to symbols C2, C3          if N1 >15$$ >15 $$ or N2 >15$$ >15 $$ then Output error; return          Output O1 = N1 * 16$$ 16 $$ + N2, Length = 8$$ 8 $$ bits          returnif C1 = “x” then          Use Table 1 to assign to P1, P2 the values corresponding to symbols C2, C3          if P1 >31$$ >31 $$ or P2 >31$$ >31 $$ then Output error; return          Write P1 and P2 in 5$$ 5 $$ bits and concatenate them in P, P = P1 | P2          Length = leftmost 3$$ 3 $$ bits of P          Bitstring = extract Length bits from P starting from the 4$$ 4 $$‐th bit          if remaining bits in P are all 0$$ 0 $$ then                    Output Bitstring, Length          else                    Output error          returnUse Table 1 to assign to V1, V2, V3 the values corresponding to symbols C1, C2, C3M = V1 * 412$$ {41}^2 $$ + V2 * 41$$ 41 $$ + V3if M <65,536$$ <\mathrm{65,536} $$ then         Output O1 = M/16$$ M/16 $$, O2 = M mod 16$$ 16 $$, Length = 16$$ 16 $$ bitselse         Output errorendIn Appendix C the reversibility of the encoding/decoding process is proven.EXAMPLESIn this section some encoding and decoding examples are reported to show all the possibilities of binary data representation.Encoding examples:consider the bit string 1111000001011010000000011000000011111111$$ 1111\ 0000\kern0.5em 0101\ 1010\kern1em 0000\ 0001\ 1000\ 0000\kern0.75em 1111\ 1111 $$ that written in decimal is 240$$ 240 $$, 90$$ 90 $$, 1$$ 1 $$, 128$$ 128 $$, 255$$ 255 $$; grouping in pairs leads to the values 61,530$$ \mathrm{61,530} $$, 384$$ 384 $$, 255$$ 255 $$ that results in the Base41 encoding tenALTzTTconsider the bit string 000000001000000111110000010110101111$$ 0000\ 0000\ 1000\ 0001\kern0.75em 1111\ 0000\kern0.5em 0101\ 1010\kern1em 1111 $$ that written in decimal is 0$$ 0 $$, 129$$ 129 $$, 240$$ 240 $$, 90$$ 90 $$, 15$$ 15 $$ (4$$ 4 $$ bits); grouping in pairs leads to the values 129$$ 129 $$, 61,530$$ \mathrm{61,530} $$, 15$$ 15 $$ (4$$ 4 $$ bits) resulting in the Base41 encoding ADHtenxZeDecoding examples:consider the Base41 sequence Dav ide Mar coB: it represents the numbers 5901$$ 5901 $$, 46,354$$ \mathrm{46,354} $$, 17,664$$ \mathrm{17,664} $$, 38,254$$ \mathrm{38,254} $$ that in binary results in000101110000110110110101000100100100010100000000$$ 0001\ 0111\ 0000\ 1101\kern1em 1011\ 0101\ 0001\ 0010\kern1em 0100\ 0101\ 0000\ 0000 $$1001010101101110$$ 1001\ 0101\ 0110\ 1110 $$the Base41 sequence xTA ide coB represents the bit string 110$$ 110 $$ followed by the numbers 46,354$$ \mathrm{46,354} $$, 38,254$$ \mathrm{38,254} $$ (in binary 10110101000100101001010101101110$$ 1011\ 0101\ 0001\ 0010\kern1em 1001\ 0101\ 0110\ 1110 $$): the interpretation of a bit string prefixing four octets is left to the application (it may rise an error or consider the bit string as a whole).PROTOCOL AND SECURITY CONSIDERATIONSThis document only defines an alphabet and an encoding mode. How a real binary string is encoded is left to the application defining the communication/representation protocol. In particular, this document:does not specify if a sequence of octets must be encoded as pairs (exception made for the possibly single final octet) or as single octets, or as a mix of pairs and single octets;does not specify the behavior of the application if a zero‐length bit string is found when decoding;does not specify the behavior of the application if one‐valued bits are found in the filling of a decoded bit string, for example, 0101100110$$ 010\ 11\ 00\mathbf{11}0 $$;does not specify the behavior of the application if a symbol not contained in the alphabet is found in an encoded string: care must be taken to avoid security attacks.Obviously, space efficiency suggests encoding data as pairs of octets leaving, if necessary, a single octet and then a 1$$ 1 $$ to 7$$ 7 $$ bit string (if present) only as final data to encode.Implementations involving Base41 encoding must prevent attacks leveraging the symbol representation of the different kinds of data (single and octet pairs and bit strings).IMPLEMENTATION CONSIDERATIONS AND EXPERIMENTAL RESULTSWe built two C programs of a Base41 encoder/decoder (available from the authors upon request): these functions may be used as a starting point for other software or hardware implementations of Base41.The first program uses a table that keeps the encodings ofthe 65,536$$ \mathrm{65,536} $$ two octet bit strings,the 256$$ 256 $$ octet strings,the 254$$ 254 $$ 1$$ 1 $$ to 7$$ 7 $$ bit strings,the empty (0$$ 0 $$ bit) string.The encoder may use this table sorted according to the input bit string; the decoder may sort this table according to the three Base41 characters encodings.The second implementation does not keep any table and performs the divisions by 41$$ 41 $$ for encoding the two octet bit strings. In this case it is possible to reduce the complexity due to a division by applying the division by multiplication (DBM) algorithm presented in Reference 14: the divisor 41$$ 41 $$ has no critical dividends for 32$$ 32 $$ and 64$$ 64 $$ bit architectures thus all divisions may be performed with multiplications and right shifts (for completeness, the J$$ J $$ values for the DBM algorithm to be used in 32$$ 32 $$ and 64$$ 64 $$ bit architectures are 3,352,169,597$$ \mathrm{3,352,169,597} $$ and 14,397,458,789,236,723,213$$ \mathrm{14,397,458,789,236,723,213} $$ respectively; the result of the integer division of V$$ V $$ by 41$$ 41 $$ is (VJ)≫(s+5)$$ \left(V\ J\right)\gg \left(s+5\right) $$, where ≫$$ \gg $$ is the binary right shift operation and s$$ s $$ is the number of bits of the architecture employed).We run the software on some files obtaining almost the same performance for both kinds of programs: on an Intel® Core™ i7‐1165G7 at 2.80$$ 2.80 $$ GHz encoding was performed at more than 80$$ 80 $$ Mo/s and decoding at more than 120$$ 120 $$ Mo/s.CONCLUSIONSIn this article has been presented a format for representing binary data using URL‐safe printable symbol sequences by means of a specific alphabet of 41$$ 41 $$ symbols taken from the uppercase and lowercase letters of the English alphabet (Figure D1 in Appendix D reports the logo we made for the proposed method).The number “41” is specific because it is the minimal number of symbols allowing to encode two octets in a sequence of three symbols.The main characteristics and advantages of the proposed method with respect to the pertinent works on Base451 and base 412 are:use of a set of glyphs that does not leave ambiguities in interpretation from a human point of view;use of a minimum set of symbols (41$$ 41 $$) required for encoding pairs of octets;use of URL‐safe characters;ability to represent bit strings of any length (not only an integer number of octets).The presented encoding is suggested in the contexts of representing binary data of any length in printable form, for example, in the encoding of data to be represented in a QR code.The prefix of three output symbols immediately allows to know the kind of data encoded (one or two octets, or a shorter bit string): in this way the decoding may be performed in a very simple and efficient manner. Moreover, the printable symbols are chosen from an alphabet of URL‐safe characters whose representations also avoid confusion by humans in reading the encoding.ACKNOWLEDGMENTSThis research has been supported by the Italian Ministero dell'Università e della Ricerca. The authors wish to thank the anonymous reviewers whose comments and observations helped in improving the clarity and quality of this work.CONFLICT OF INTERESTThe author declares no potential conflict of interest.DATA AVAILABILITY STATEMENTData sharing not applicable to this article as no datasets were generated or analyzed during the current study.DISCLAIMERThe information provided in this document is given as is. The authors of this publication cannot be considered liable for the consequences of the use of the information contained herein.REFERENCESFältström P, Ljunggren F, van Gulik DW. The Base45 data encoding. RFC 9285, RFC Editor; 2022. Accessed November 18, 2022. doi: 10.17487/RFC9285.Veljkovic S. Base41; 2014. Accessed November 18, 2022. https://github.com/sveljko/base41Wu P‐C. A Base62 transformation format of ISO 10646 for multilingual identifiers. Softw Pract Exp. 2001;31(12):1125‐1130. doi:10.1002/spe.408Nakamoto S, Sporny M. The Base58 encoding scheme. Internet Draft, 2021; IETF. Accessed November 18, 2022. https://datatracker.ietf.org/doc/html/draft‐msporny‐base58‐03He K, Xu X, Yue Q. A secure, lossless, and compressed Base62 encoding. Proceedings of the 2008 11th IEEE Singapore International Conference on Communication Systems; 19–21 November 2008:761‐765; Guangzhou, China. 10.1109/ICCS.2008.4737287Josefsson S. The Base16, Base32, and Base64 data encodings. RFC 4648, RFC Editor; 2006. Accessed November 18, 2022. doi: 10.17487/RFC4648, https://rfc‐editor.org/rfc/rfc4648.txtAdobe Systems Incorporated. PostScript® Language Reference. 3rd ed. Addison‐Wesley Publishing Company; 1999.Elz R. A compact representation of IPv6 addresses. RFC 1924, RFC Editor, 1996. Accessed November 18, 2022. doi: 10.17487/RFC1924. https://rfc‐editor.org/rfc/rfc1924.txtHe D, Sun Y, Jia Z, et al. A proposal of substitute for Base85/64–Base91. Proceedings of the SUMMER 8th International Conference on Computing, Communications and Control Technologies: CCCT 2010; 2010; Orlando, FL.Henke J. Base91 encoding; 2006. Accessed November 18, 2022. http://base91.sourceforge.net/Python 3.11.0 documentation, int() class. Accessed November 21, 2022. https://docs.python.org/3/library/functions.htmlPHP math functions, base_convert() function. Accessed November 21, 2022. https://www.php.net/manual/en/ref.math.phpJavaScript reference, number constructor, toString() method. Accessed November 21, 2022. https://developer.mozilla.org/en‐US/docs/Web/JavaScript/Reference/Global_Objects/NumberCavagnino D, Werbrouck AE. Efficient algorithms for integer division by constants using multiplication. Comput J. 2008;51(4):470‐480. doi:10.1093/comjnl/bxm082AAPPENDIXThe following discussion presents the computation of the data expansion factor as a function of the number of octets and eventual trailing bits making the original binary data.Let us suppose that the original data is composed of m$$ m $$ pairs of octets, m≥0$$ m\ge 0 $$, then a possible single octet (in case the data length is an odd number of octets) indicated by q∈{0,1}$$ q\in \left\{0,1\right\} $$ and that there are possibly following n$$ n $$ bits, with 0≤n≤7$$ 0\le n\le 7 $$.The number of bits in the original data isA1to=16m+8q+n.$$ {t}_o=16\ m+8\ q+n. $$According to the proposed Base41 encoding the number of bits in the resulting bit string isA2tr=24m+24q+24ϕ(n)$$ {t}_r=24\ m+24\ q+24\ \phi (n) $$whereA3ϕ(n)=1ifn>00otherwise$$ \phi (n)=\left\{\begin{array}{cc}1& \mathrm{if}\ n>0\kern.9em \\ {}0& \mathrm{otherwise}\end{array}\right. $$The data expansion factor is defined as trto$$ \frac{t_r}{t_o} $$ and it is easy to see thatA4limm→∞trto=1.5$$ \underset{m\to \infty }{\lim}\frac{t_r}{t_o}=1.5 $$that is, for a large number of octets the increase in size is 50%$$ 50\% $$ because the eventual trailing octet and bits are negligible.Table A1 reports the values of the data expansion factor for many values of m$$ m $$ when the trailing single octet is not present and when it is present. These values are computed for n=1$$ n=1 $$ because this is the worst case that requires three octets to encode a single bit.A1TABLEData expansion factor computed for different values of m and q; for n it is assumed the worst case in increasing the data size (value 1)m$$ m $$q$$ q $$0$$ 0 $$1$$ 1 $$0$$ 0 $$24$$ 24 $$5.33$$ 5.33 $$1$$ 1 $$2.82$$ 2.82 $$2.88$$ 2.88 $$10$$ 10 $$1.64$$ 1.64 $$1.70$$ 1.70 $$50$$ 50 $$1.53$$ 1.53 $$1.54$$ 1.54 $$1024$$ 1024 $$1.50$$ 1.50 $$1.50$$ 1.50 $$BAPPENDIXWe provide flow charts of the encoding process for the three kinds of data, namely two octets, one octet, and a bit string. Near every flow chart it is reported an example of the corresponding coding (Figures B1‐B3).B1FIGUREFlow chart and example of two octets encodingB2FIGUREFlow chart and example of one octet encodingB3FIGUREFlow chart and example of bit string encodingCAPPENDIXThe reversibility of encoding to and decoding from Base41 may be proved examining three cases separately.In case of encoding two octets the process is reversible as any other conversion between different bases (e.g., decimal and binary). Note that the maximum possible value, 65,535$$ \mathrm{65,535} $$, does not have as most significant digit x nor z, as all the possible two octet values.For the same reason one octet encoding/decoding is reversible producing two Base41 symbols each one associated to a nibble of the octet: prefixing the two Base41 symbols with z allows to distinguish this case from the two octets one, making the process reversible.Encoding of a bit string (of 1$$ 1 $$ up to 7$$ 7 $$ bits) starts formatting it with a three bits prefix specifying its length, then filling it with trailing zeroes to have a length of 10$$ 10 $$ bits. Halving the obtained string into two parts of 5$$ 5 $$ bits each and encoding every part with a Base41 symbol (note that this is possible because 5$$ 5 $$ bits allow for a maximum of 32$$ 32 $$ configurations which is smaller than 41$$ 41 $$) permits to reversibly obtain the 5$$ 5 $$ bits from the Base41 symbol; prefixing these two Base41 symbols with x allows to distinguish this case from the previous two leading to a reversible process.Having shown the reversibility of the encoding/decoding process for the three cases which are always distinguishable proves the reversibility of the proposed Base41 conversion.DAPPENDIXThis appendix reports the logo for the proposed Base41 encoding/decoding (Figure D1).D1FIGURELogo for the proposed Base41 encoding/decoding http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Engineering Reports Wiley

Base41: A proposal for printable encoding of bit strings

Engineering Reports , Volume 5 (5) – May 1, 2023

Loading next page...
 
/lp/wiley/base41-a-proposal-for-printable-encoding-of-bit-strings-iwV6Efv440

References (10)

Publisher
Wiley
Copyright
© 2023 John Wiley & Sons, Ltd.
eISSN
2577-8196
DOI
10.1002/eng2.12606
Publisher site
See Article on Publisher Site

Abstract

INTRODUCTIONIn the field of computer science there are many applications that need a representation of data in printable form, that is, encoded with a set of characters that can be printed and read by a human. Examples of such software and hardware are some electronic mail clients and servers or QR‐code readers. To cope with these systems many frameworks have been developed to translate (i.e., encode and decode) data from binary to printable form and vice versa.In the following we will introduce such an encoding system and compare it with existing ones.In this context encoding means representing a number written as a numeral in a certain system with a numeral written in another system (e.g., the number of legs of a dog is represented with, among others, the numerals 4$$ 4 $$ (Arabic numerals), IV (Roman numerals), four (English language)). Within these systems, positional systems have a noticeable relevance. A positional system uses a base B (with B >1$$ >1 $$, B ∈ℕ$$ \in \mathbb{N} $$) having B$$ B $$ symbols to represent numbers as a weighted sum of integer powers of B, that is, a number M is decomposed in An Bn + An‐1 Bn‐1 + … + A0 B0 and written as AnAn‐1An‐2 …A0.In the following the term Base Y refers to a method that uses a base of Y symbols.The objective of this article is to present an encoding format of binary data by means of printable symbol sequences using an alphabet of 41$$ 41 $$ symbols.The proposal in this article recalls the Base451 encoding and the Base41 proposal.2 As in Reference 1 this proposal encodes two octets with three symbols and also allows for encoding one octet or a shorter bit string in three symbols.Differently from Reference 3, where the ten digits and the uppercase and lowercase letters of the English alphabet (totaling 62$$ 62 $$ symbols) are used in an encoding system, called UTF‐62, for multilingual identifiers, the present proposal employs 41$$ 41 $$ letters only: motivation for the use of 41$$ 41 $$ symbols is that 41$$ 41 $$ is the minimum base that may be used to encode 2$$ 2 $$ octets (216=65,536$$ {2}^{16}=\mathrm{65,536} $$ configurations) with 3$$ 3 $$ symbols, that is, 403=64,000<216<413=68,921$$ {40}^3=\mathrm{64,000}<{2}^{16}<{41}^3=\mathrm{68,921} $$. Moreover, part of the exceeding configurations allows for the representation of bit strings with length up to 8$$ 8 $$ bits. The alphabet is composed by the 20$$ 20 $$ uppercase letters “ABCDFGHJKLMNQRSTUVXZ” and the 21$$ 21 $$ lowercase letters “abcdefhikmnopqrstuvxz” from the English alphabet excluding “EIOPWYgjlwy”: this has the effect of using only URL‐safe characters whose graphical representation does not give rise to ambiguities in the visual interpretation of the glyphs by a human. In fact, as done in the Base58 encoding4 for some letters, the proposed Base41 encoding avoids the possible visual uncertainty in the printed string between:“Q” (capital q), “O” (capital o), and “0” (digit zero),“B” and “E”,“E” and “F”,“P” and “R”,“g” (lowercase G) and “q” (lowercase Q),“l” (lowercase L), “I” (capital i), and “1” (digit one),“i” (lowercase I) and “j” (lowercase J),“vv” and “w”, both lowercase and uppercase,“v” and “y”, both lowercase and uppercase.The main differences with Base451 encoding are:a smaller set of symbols;the use of uppercase and lowercase letters from the English alphabet;the use of URL‐safe characters only;a smaller number of unused encoded sequences (considered to be rejected in Reference 1);limiting the similar glyphs in the encoding alphabet;the mode of encoding of octet strings of odd octet length;the encoding of bit strings of any length (not necessarily an integer number of octets).The presented Base41 encoding differs from the one in Reference 2 for the advantages of the alphabet chosen to represent the 41$$ 41 $$ symbols, which uses URL‐safe characters only (that is, characters that have no special meaning or function in coding URLs and thus do not need to be escaped with the % sign), and for the possibility to uniformly encode bit strings of any length.In addition to the cited works on the bases 41$$ 41 $$, 45$$ 45 $$, 58,$$ 58, $$ and 62$$ 62 $$, it is important to note that in previous years many works have been published on the theme of encoding binary strings using a printable format.The base 62$$ 62 $$ is also used in Reference 5 to build a printable encoding from a bit stream that is read 6$$ 6 $$ bit at a time. Care is taken in case the input stream has a bit length not multiple of 6$$ 6 $$ performing the proper padding.The Base646 encoding represents three octets with four symbols leading to a compact representation that uses the lowercase and uppercase letters of the English alphabet, the ten digits and two special characters. The same document6 defines two more encodings, one based on 16$$ 16 $$ symbols and the other using 32$$ 32 $$ symbols.An encoding that uses 85$$ 85 $$ symbols (the ASCII characters from code 33$$ 33 $$ to code 117$$ 117 $$) to represent 4$$ 4 $$ octets with 5$$ 5 $$ characters has been proposed in Reference 7, where the encoding is called Ascii85.A readable representation of IPV6 addresses is also obtained with a base with 85$$ 85 $$ symbols: the used alphabet and the encoding/decoding are defined in Reference 8.Two works,9,10 use the base 91$$ 91 $$. Reference 9 represents blocks of 13$$ 13 $$ bits with pairs of signs from an alphabet of 91$$ 91 $$ characters that is a subset of printable ASCII symbols; some pairs are used to indicate how many bits to discard in the last block in case it has a length different from 13$$ 13 $$ bits. Differently from Reference 9, which has some unused Base91 pairs, Reference 10 makes use of all the pairs to encode blocks of 13$$ 13 $$ bits: in some cases, blocks of 14$$ 14 $$ bits are encoded saturating all the available 912$$ {91}^2 $$ configurations.Base 36$$ 36 $$ is also frequently used in many programming languages (e.g., Python,11 PHP,12 Javascript13) that have routines for its conversions: in general, the alphabet is composed by the 26$$ 26 $$ letters of the English alphabet and the 10$$ 10 $$ digits.The structure of the article is the following: first, we introduce some notation used throughout the article. Section 2 presents the proposed Base41 alphabet. Section 3 illustrates the encoding and decoding procedures detailing them with pseudo‐code algorithms; Section 4 shows some encoding and decoding examples. In Section 5 some considerations about the protocol and security issues are discussed, and Section 6 presents some details regarding the implementation and provides experimental results. Finally, Section 7 draws some conclusions.NotationIn the rest of the article the following variables will be used:C1, C2, C3 represent Base41 symbols; an instantiation of a Base41 symbol is written with a different font like “x” or “z”;M is a number to be converted from binary to Base41 and vice versa;N1, N2 are nibbles, that is, sequences of 4$$ 4 $$ bits;O1, O2 are octets, that is, sequences of 8$$ 8 $$ bits;P1, P2 represent numbers of 5$$ 5 $$ bits;V1, V2, V3 represent numeric values of Base41 symbols ranging from 0$$ 0 $$ to 40$$ 40 $$.BASE41 ALPHABETThe Base41 alphabet is composed by the following 41$$ 41 $$ letters (note that no digits are used):ABCDFGHJKLMNQRSTUVXZabcdefhikmnopqrstuvxzEach letter is associated to a numerical value according to Table 1. To convert a number to base 41$$ 41 $$ any method developed for such purpose may be used: for example, using successive divisions by 41$$ 41 $$ the obtained sequence of remainders will be made of numbers with values between 0$$ 0 $$ and 40$$ 40 $$; each value is used as an index in the sequence of letters of the proposed Base41 alphabet and the extracted letters are concatenated to get the Base41 representation of the number (in the present proposed alphabet).1TABLEProposed Base41 symbol tableValueBase41 symbolValueBase41 symbolValueBase41 symbolValueBase41 symbol0$$ 0 $$A10$$ 10 $$M20$$ 20 $$a30$$ 30 $$n1$$ 1 $$B11$$ 11 $$N21$$ 21 $$b31$$ 31 $$o2$$ 2 $$C12$$ 12 $$Q22$$ 22 $$c32$$ 32 $$p3$$ 3 $$D13$$ 13 $$R23$$ 23 $$d33$$ 33 $$q4$$ 4 $$F14$$ 14 $$S24$$ 24 $$e34$$ 34 $$r5$$ 5 $$G15$$ 15 $$T25$$ 25 $$f35$$ 35 $$s6$$ 6 $$H16$$ 16 $$U26$$ 26 $$h36$$ 36 $$t7$$ 7 $$J17$$ 17 $$V27$$ 27 $$i37$$ 37 $$U8$$ 8 $$K18$$ 18 $$X28$$ 28 $$k38$$ 38 $$v9$$ 9 $$L19$$ 19 $$Z29$$ 29 $$m39$$ 39 $$x‐‐‐‐‐‐40$$ 40 $$zLet us see an example: given the decimal numeral 2,357,293$$ \mathrm{2,357,293} $$ to convert it to Base41 we start dividing it by 41$$ 41 $$ obtaining 57,494$$ \mathrm{57,494} $$ and remainder 39$$ 39 $$; the 39th letter (starting from 0$$ 0 $$) in the proposed Base41 alphabet is x. Then divide 57,494$$ \mathrm{57,494} $$ by 41$$ 41 $$ obtaining 1402$$ 1402 $$ and remainder 12$$ 12 $$; the 12th letter in the proposed Base41 alphabet is Q. Continuing to divide 1402$$ 1402 $$ by 41$$ 41 $$ we have 34$$ 34 $$ and remainder 8$$ 8 $$ which is associated to letter K. Dividing 34$$ 34 $$ by 41$$ 41 $$ results in 0$$ 0 $$ (stopping the iteration) with remainder 34$$ 34 $$ which is coded with r. Writing from left to right the letters obtained from last to first we have rKQx which is the Base41 numeral for the decimal numeral 2,357,293$$ \mathrm{2,357,293} $$. In case the numeral must be written with more symbols it may be left padded with the Base41 symbol A which has a corresponding value of 0$$ 0 $$: rKQx, ArKQx, AArKQx, AAArKQx, … all represent the same number.BASE41 ENCODING AND DECODINGBinary data encoding may be performed according to the data size. Base41 encoding allows the representation of pairs of octets (16$$ 16 $$ bits), single octets (8$$ 8 $$ bits) and groups of 1$$ 1 $$ to 7$$ 7 $$ bits. All these kinds of data are encoded in 3$$ 3 $$ Base41 symbols: thus, for sequences of a large number of bits the size of the resulting data is approximately 1.5$$ 1.5 $$ the dimension of the original binary data. A detailed analysis of this data expansion factor is reported in Appendix A.A pair of octets O1, O2 is interpreted as a 16$$ 16 $$ bits number M = O1 * 256$$ 256 $$ + O2 (first octet most significant). Then, M is converted in base 41$$ 41 $$ and represented by the three Base41 symbols C1, C2, C3, that is, M = C1 * 41$$ 41 $$ * 41$$ 41 $$ + C2 * 41$$ 41 $$ + C3 (first Base41 symbol most significant). The minimum 16$$ 16 $$ bits value, 0$$ 0 $$, is represented in Base41 as AAA. The maximum 16$$ 16 $$ bits value, 65,535$$ \mathrm{65,535} $$, is represented in Base41 as vzV. Algorithm 1 presents the encoding procedure pseudo‐code for 2$$ 2 $$ octets (see flow chart in Figure B1).Note that neither “x” nor “z” may occur as first symbol. These characters will be used as prefixes of sequences of three symbols that encode, respectively, bit strings and single octets.A single octet O1 is considered composed by two nibbles N1 and N2, N1 most significant, that is, O1 = N1 * 16$$ 16 $$ + N2. The decimal values of N1 and N2 are used as indexes in the Base41 table to obtain two symbols C1 and C2: these symbols are concatenated to the Base41 symbol “z” producing a string of three characters z C1 C2 to obtain the Base41 representation of the single octet. The minimum 8$$ 8 $$ bits value, 0$$ 0 $$, is represented in Base41 as zAA. The maximum 8$$ 8 $$ bits value, 255$$ 255 $$, is represented in Base41 as zTT. Algorithm 2 presents the encoding procedure pseudo‐code for a single octet (see flow chart in Figure B2).A bit string composed of 1$$ 1 $$ bit (b1), 2$$ 2 $$ bits (b1 b2), … or 7$$ 7 $$ bits (b1 b2 b3 b4 b5 b6 b7), leftmost most significant bit, is first represented with ten bits, the leading three bits encoding the length as in Table 2 (note the zero valued bits for filling the ten bits).1AlgorithmEncoding procedure pseudo‐code for two octets O1, O2beginM = O1 * 256$$ 256 $$ + O2Convert M to base 41$$ 41 $$: M = V1 * 412$$ {41}^2 $$ + V2 * 41$$ 41 $$ + V3Map the values V1, V2, V3 to the symbols C1, C2, C3 using Table 1Output C1, C2, C3end2AlgorithmEncoding procedure pseudo‐code for one octet O1beginExtract the two nibbles N1, N2 of O1, that is O1 = N1 * 16$$ 16 $$ + N2Use Table 1 to convert the values N1 and N2 to base 41, obtaining symbols C1 and C2Output “z”, C1, C2end2TABLERepresentation of bit strings for the proposed Base41 encoding using ten bits: The first two columns encode five bits that will be represented by a decimal number P1, the last column encodes the remaining five bits represented by a decimal number P2LengthBit string (zero filled)001b1000000010b1b200000011b1b2b30000100b1b2b3b4000101b1b2b3b4b500110b1b2b3b4b5b60111b1b2b3b4b5b6b7The decimal values P1 (representing the five bits in the first columns of Table 2) and P2 (representing the five bits in the last five columns of Table 2) of the two groups of five bits are used as indexes in the Base41 table to obtain two symbols C1 and C2: these symbols are concatenated to the Base41 symbol “x” to obtain the Base41 representation of the bit string. The minimum value single bit string, 0$$ 0 $$, is represented in Base41 as xFA. The maximum value seven‐bit string, 1111111$$ 1111111 $$, is represented in Base41 as xoo. Note that not all the intermediate values are possible, for example, a bit length of 2$$ 2 $$ with the trailing five bits valued 1$$ 1 $$ may not happen. Moreover, it is possible to extend this code to represent the empty string with a 0$$ 0 $$ bit length: the Base41 representation following the previous encoding mode will be xAA. The following Algorithm 3 presents the encoding procedure pseudo‐code for a bit string of maximum length 7$$ 7 $$ bits (see flow chart in Figure B3).Encoding of a bit stream should be performed by first considering it as a sequence of contiguous pairs of octets each one coded separately, then encoding the eventual trailing octet and then encoding the possibly remaining 1$$ 1 $$ to 7$$ 7 $$ bits. Nonetheless, an application may divide the input to its convenience and encode every part as a bit stream on its own.3AlgorithmEncoding procedure pseudo‐code for bit string b1, b2, …, bn, with 0<$$ 0< $$ n <8$$ <8 $$beginRepresent the bit string and its length according to Table 2Halve the ten bits into P1 and P2 and map their values to base 41$$ 41 $$ using Table 1 obtaining symbols C1 and C2Output “x”, C1, C2endAppendix B reports flow charts and examples of executions of these algorithms.One side effect of this encoding is that by inspecting the first character in each group of three symbols it is possible to immediately infer which kind of data is encoded (two octets, one octet or a bit string). In fact, decoding may be performed by splitting the stream of Base41 symbols in groups of 3$$ 3 $$ letters, then:if the first symbol of the group is “z” then it encodes a single octet; the values of the two nibbles of this octet are obtained from Table 1 using as indexes the second and third symbols of the group;if the first symbol of the group is “x” then it represents a bit string; two groups of five bits each are acquired from Table 1 using as indexes the second and third symbols of the group, then the resulting ten bits are decoded according to Table 2 to get the number of bits and their values to form the encoded bit string;otherwise, the three symbols encode two octets; first, the three symbols are transformed in numbers V1, V2, V3 according to Table 1 and then converted from base 41$$ 41 $$ as M = V1 * 41$$ 41 $$ * 41$$ 41 $$ + V2 * 41$$ 41 $$ + V3: if M <65,536$$ <\mathrm{65,536} $$ then M is the representation of the two octets (M ≥65,536$$ \ge \mathrm{65,536} $$ is an encoding error).The following Algorithm 4 shows the decoding procedure pseudo‐code. Please note that an error is issued in case of trailing ones when decoding a bit string, but this is only the suggested behavior.4AlgorithmDecoding procedure pseudo‐code of three symbols C1, C2, C3beginif C1 = “z” then          Use Table 1 to assign to N1, N2 the values corresponding to symbols C2, C3          if N1 >15$$ >15 $$ or N2 >15$$ >15 $$ then Output error; return          Output O1 = N1 * 16$$ 16 $$ + N2, Length = 8$$ 8 $$ bits          returnif C1 = “x” then          Use Table 1 to assign to P1, P2 the values corresponding to symbols C2, C3          if P1 >31$$ >31 $$ or P2 >31$$ >31 $$ then Output error; return          Write P1 and P2 in 5$$ 5 $$ bits and concatenate them in P, P = P1 | P2          Length = leftmost 3$$ 3 $$ bits of P          Bitstring = extract Length bits from P starting from the 4$$ 4 $$‐th bit          if remaining bits in P are all 0$$ 0 $$ then                    Output Bitstring, Length          else                    Output error          returnUse Table 1 to assign to V1, V2, V3 the values corresponding to symbols C1, C2, C3M = V1 * 412$$ {41}^2 $$ + V2 * 41$$ 41 $$ + V3if M <65,536$$ <\mathrm{65,536} $$ then         Output O1 = M/16$$ M/16 $$, O2 = M mod 16$$ 16 $$, Length = 16$$ 16 $$ bitselse         Output errorendIn Appendix C the reversibility of the encoding/decoding process is proven.EXAMPLESIn this section some encoding and decoding examples are reported to show all the possibilities of binary data representation.Encoding examples:consider the bit string 1111000001011010000000011000000011111111$$ 1111\ 0000\kern0.5em 0101\ 1010\kern1em 0000\ 0001\ 1000\ 0000\kern0.75em 1111\ 1111 $$ that written in decimal is 240$$ 240 $$, 90$$ 90 $$, 1$$ 1 $$, 128$$ 128 $$, 255$$ 255 $$; grouping in pairs leads to the values 61,530$$ \mathrm{61,530} $$, 384$$ 384 $$, 255$$ 255 $$ that results in the Base41 encoding tenALTzTTconsider the bit string 000000001000000111110000010110101111$$ 0000\ 0000\ 1000\ 0001\kern0.75em 1111\ 0000\kern0.5em 0101\ 1010\kern1em 1111 $$ that written in decimal is 0$$ 0 $$, 129$$ 129 $$, 240$$ 240 $$, 90$$ 90 $$, 15$$ 15 $$ (4$$ 4 $$ bits); grouping in pairs leads to the values 129$$ 129 $$, 61,530$$ \mathrm{61,530} $$, 15$$ 15 $$ (4$$ 4 $$ bits) resulting in the Base41 encoding ADHtenxZeDecoding examples:consider the Base41 sequence Dav ide Mar coB: it represents the numbers 5901$$ 5901 $$, 46,354$$ \mathrm{46,354} $$, 17,664$$ \mathrm{17,664} $$, 38,254$$ \mathrm{38,254} $$ that in binary results in000101110000110110110101000100100100010100000000$$ 0001\ 0111\ 0000\ 1101\kern1em 1011\ 0101\ 0001\ 0010\kern1em 0100\ 0101\ 0000\ 0000 $$1001010101101110$$ 1001\ 0101\ 0110\ 1110 $$the Base41 sequence xTA ide coB represents the bit string 110$$ 110 $$ followed by the numbers 46,354$$ \mathrm{46,354} $$, 38,254$$ \mathrm{38,254} $$ (in binary 10110101000100101001010101101110$$ 1011\ 0101\ 0001\ 0010\kern1em 1001\ 0101\ 0110\ 1110 $$): the interpretation of a bit string prefixing four octets is left to the application (it may rise an error or consider the bit string as a whole).PROTOCOL AND SECURITY CONSIDERATIONSThis document only defines an alphabet and an encoding mode. How a real binary string is encoded is left to the application defining the communication/representation protocol. In particular, this document:does not specify if a sequence of octets must be encoded as pairs (exception made for the possibly single final octet) or as single octets, or as a mix of pairs and single octets;does not specify the behavior of the application if a zero‐length bit string is found when decoding;does not specify the behavior of the application if one‐valued bits are found in the filling of a decoded bit string, for example, 0101100110$$ 010\ 11\ 00\mathbf{11}0 $$;does not specify the behavior of the application if a symbol not contained in the alphabet is found in an encoded string: care must be taken to avoid security attacks.Obviously, space efficiency suggests encoding data as pairs of octets leaving, if necessary, a single octet and then a 1$$ 1 $$ to 7$$ 7 $$ bit string (if present) only as final data to encode.Implementations involving Base41 encoding must prevent attacks leveraging the symbol representation of the different kinds of data (single and octet pairs and bit strings).IMPLEMENTATION CONSIDERATIONS AND EXPERIMENTAL RESULTSWe built two C programs of a Base41 encoder/decoder (available from the authors upon request): these functions may be used as a starting point for other software or hardware implementations of Base41.The first program uses a table that keeps the encodings ofthe 65,536$$ \mathrm{65,536} $$ two octet bit strings,the 256$$ 256 $$ octet strings,the 254$$ 254 $$ 1$$ 1 $$ to 7$$ 7 $$ bit strings,the empty (0$$ 0 $$ bit) string.The encoder may use this table sorted according to the input bit string; the decoder may sort this table according to the three Base41 characters encodings.The second implementation does not keep any table and performs the divisions by 41$$ 41 $$ for encoding the two octet bit strings. In this case it is possible to reduce the complexity due to a division by applying the division by multiplication (DBM) algorithm presented in Reference 14: the divisor 41$$ 41 $$ has no critical dividends for 32$$ 32 $$ and 64$$ 64 $$ bit architectures thus all divisions may be performed with multiplications and right shifts (for completeness, the J$$ J $$ values for the DBM algorithm to be used in 32$$ 32 $$ and 64$$ 64 $$ bit architectures are 3,352,169,597$$ \mathrm{3,352,169,597} $$ and 14,397,458,789,236,723,213$$ \mathrm{14,397,458,789,236,723,213} $$ respectively; the result of the integer division of V$$ V $$ by 41$$ 41 $$ is (VJ)≫(s+5)$$ \left(V\ J\right)\gg \left(s+5\right) $$, where ≫$$ \gg $$ is the binary right shift operation and s$$ s $$ is the number of bits of the architecture employed).We run the software on some files obtaining almost the same performance for both kinds of programs: on an Intel® Core™ i7‐1165G7 at 2.80$$ 2.80 $$ GHz encoding was performed at more than 80$$ 80 $$ Mo/s and decoding at more than 120$$ 120 $$ Mo/s.CONCLUSIONSIn this article has been presented a format for representing binary data using URL‐safe printable symbol sequences by means of a specific alphabet of 41$$ 41 $$ symbols taken from the uppercase and lowercase letters of the English alphabet (Figure D1 in Appendix D reports the logo we made for the proposed method).The number “41” is specific because it is the minimal number of symbols allowing to encode two octets in a sequence of three symbols.The main characteristics and advantages of the proposed method with respect to the pertinent works on Base451 and base 412 are:use of a set of glyphs that does not leave ambiguities in interpretation from a human point of view;use of a minimum set of symbols (41$$ 41 $$) required for encoding pairs of octets;use of URL‐safe characters;ability to represent bit strings of any length (not only an integer number of octets).The presented encoding is suggested in the contexts of representing binary data of any length in printable form, for example, in the encoding of data to be represented in a QR code.The prefix of three output symbols immediately allows to know the kind of data encoded (one or two octets, or a shorter bit string): in this way the decoding may be performed in a very simple and efficient manner. Moreover, the printable symbols are chosen from an alphabet of URL‐safe characters whose representations also avoid confusion by humans in reading the encoding.ACKNOWLEDGMENTSThis research has been supported by the Italian Ministero dell'Università e della Ricerca. The authors wish to thank the anonymous reviewers whose comments and observations helped in improving the clarity and quality of this work.CONFLICT OF INTERESTThe author declares no potential conflict of interest.DATA AVAILABILITY STATEMENTData sharing not applicable to this article as no datasets were generated or analyzed during the current study.DISCLAIMERThe information provided in this document is given as is. The authors of this publication cannot be considered liable for the consequences of the use of the information contained herein.REFERENCESFältström P, Ljunggren F, van Gulik DW. The Base45 data encoding. RFC 9285, RFC Editor; 2022. Accessed November 18, 2022. doi: 10.17487/RFC9285.Veljkovic S. Base41; 2014. Accessed November 18, 2022. https://github.com/sveljko/base41Wu P‐C. A Base62 transformation format of ISO 10646 for multilingual identifiers. Softw Pract Exp. 2001;31(12):1125‐1130. doi:10.1002/spe.408Nakamoto S, Sporny M. The Base58 encoding scheme. Internet Draft, 2021; IETF. Accessed November 18, 2022. https://datatracker.ietf.org/doc/html/draft‐msporny‐base58‐03He K, Xu X, Yue Q. A secure, lossless, and compressed Base62 encoding. Proceedings of the 2008 11th IEEE Singapore International Conference on Communication Systems; 19–21 November 2008:761‐765; Guangzhou, China. 10.1109/ICCS.2008.4737287Josefsson S. The Base16, Base32, and Base64 data encodings. RFC 4648, RFC Editor; 2006. Accessed November 18, 2022. doi: 10.17487/RFC4648, https://rfc‐editor.org/rfc/rfc4648.txtAdobe Systems Incorporated. PostScript® Language Reference. 3rd ed. Addison‐Wesley Publishing Company; 1999.Elz R. A compact representation of IPv6 addresses. RFC 1924, RFC Editor, 1996. Accessed November 18, 2022. doi: 10.17487/RFC1924. https://rfc‐editor.org/rfc/rfc1924.txtHe D, Sun Y, Jia Z, et al. A proposal of substitute for Base85/64–Base91. Proceedings of the SUMMER 8th International Conference on Computing, Communications and Control Technologies: CCCT 2010; 2010; Orlando, FL.Henke J. Base91 encoding; 2006. Accessed November 18, 2022. http://base91.sourceforge.net/Python 3.11.0 documentation, int() class. Accessed November 21, 2022. https://docs.python.org/3/library/functions.htmlPHP math functions, base_convert() function. Accessed November 21, 2022. https://www.php.net/manual/en/ref.math.phpJavaScript reference, number constructor, toString() method. Accessed November 21, 2022. https://developer.mozilla.org/en‐US/docs/Web/JavaScript/Reference/Global_Objects/NumberCavagnino D, Werbrouck AE. Efficient algorithms for integer division by constants using multiplication. Comput J. 2008;51(4):470‐480. doi:10.1093/comjnl/bxm082AAPPENDIXThe following discussion presents the computation of the data expansion factor as a function of the number of octets and eventual trailing bits making the original binary data.Let us suppose that the original data is composed of m$$ m $$ pairs of octets, m≥0$$ m\ge 0 $$, then a possible single octet (in case the data length is an odd number of octets) indicated by q∈{0,1}$$ q\in \left\{0,1\right\} $$ and that there are possibly following n$$ n $$ bits, with 0≤n≤7$$ 0\le n\le 7 $$.The number of bits in the original data isA1to=16m+8q+n.$$ {t}_o=16\ m+8\ q+n. $$According to the proposed Base41 encoding the number of bits in the resulting bit string isA2tr=24m+24q+24ϕ(n)$$ {t}_r=24\ m+24\ q+24\ \phi (n) $$whereA3ϕ(n)=1ifn>00otherwise$$ \phi (n)=\left\{\begin{array}{cc}1& \mathrm{if}\ n>0\kern.9em \\ {}0& \mathrm{otherwise}\end{array}\right. $$The data expansion factor is defined as trto$$ \frac{t_r}{t_o} $$ and it is easy to see thatA4limm→∞trto=1.5$$ \underset{m\to \infty }{\lim}\frac{t_r}{t_o}=1.5 $$that is, for a large number of octets the increase in size is 50%$$ 50\% $$ because the eventual trailing octet and bits are negligible.Table A1 reports the values of the data expansion factor for many values of m$$ m $$ when the trailing single octet is not present and when it is present. These values are computed for n=1$$ n=1 $$ because this is the worst case that requires three octets to encode a single bit.A1TABLEData expansion factor computed for different values of m and q; for n it is assumed the worst case in increasing the data size (value 1)m$$ m $$q$$ q $$0$$ 0 $$1$$ 1 $$0$$ 0 $$24$$ 24 $$5.33$$ 5.33 $$1$$ 1 $$2.82$$ 2.82 $$2.88$$ 2.88 $$10$$ 10 $$1.64$$ 1.64 $$1.70$$ 1.70 $$50$$ 50 $$1.53$$ 1.53 $$1.54$$ 1.54 $$1024$$ 1024 $$1.50$$ 1.50 $$1.50$$ 1.50 $$BAPPENDIXWe provide flow charts of the encoding process for the three kinds of data, namely two octets, one octet, and a bit string. Near every flow chart it is reported an example of the corresponding coding (Figures B1‐B3).B1FIGUREFlow chart and example of two octets encodingB2FIGUREFlow chart and example of one octet encodingB3FIGUREFlow chart and example of bit string encodingCAPPENDIXThe reversibility of encoding to and decoding from Base41 may be proved examining three cases separately.In case of encoding two octets the process is reversible as any other conversion between different bases (e.g., decimal and binary). Note that the maximum possible value, 65,535$$ \mathrm{65,535} $$, does not have as most significant digit x nor z, as all the possible two octet values.For the same reason one octet encoding/decoding is reversible producing two Base41 symbols each one associated to a nibble of the octet: prefixing the two Base41 symbols with z allows to distinguish this case from the two octets one, making the process reversible.Encoding of a bit string (of 1$$ 1 $$ up to 7$$ 7 $$ bits) starts formatting it with a three bits prefix specifying its length, then filling it with trailing zeroes to have a length of 10$$ 10 $$ bits. Halving the obtained string into two parts of 5$$ 5 $$ bits each and encoding every part with a Base41 symbol (note that this is possible because 5$$ 5 $$ bits allow for a maximum of 32$$ 32 $$ configurations which is smaller than 41$$ 41 $$) permits to reversibly obtain the 5$$ 5 $$ bits from the Base41 symbol; prefixing these two Base41 symbols with x allows to distinguish this case from the previous two leading to a reversible process.Having shown the reversibility of the encoding/decoding process for the three cases which are always distinguishable proves the reversibility of the proposed Base41 conversion.DAPPENDIXThis appendix reports the logo for the proposed Base41 encoding/decoding (Figure D1).D1FIGURELogo for the proposed Base41 encoding/decoding

Journal

Engineering ReportsWiley

Published: May 1, 2023

Keywords: alphabet definition; data coding; English alphabet; printable encoding

There are no references for this article.