#### How does basic substitution work?

Substitution ciphers are based on taking each character in the plain text and exchanging it for another character, according to some rule, to form the cipher text. In fact substitution ciphers are the basis of most encryption systems used today.

In principle, the simplest type of substitution system is called monoalphabetic cipher in which a single mapping table is used to encode and decode with relative ease. For example, A becomes G, B becomes X, and so on.

Of all the monoalphabetic cipher systems, the most basic type is called the Caesar cipher, perhaps because Julius Caesar first used it to send secret messages. It is based on moving each letter in the plain text by a fixed amount to form the cipher text.

For example, the message, CAN YOU READ THIS becomes JHU FVB YLHK AOPZ by adding seven letters to each character in the plain text. Notice how T, the 20th letter, becomes the 20+7=27th letter, so the alphabet wraps around and the 27th letter is the same as the 1st letter, A.

At first glance it seems like a secure system, until we consider the fact that constantly adding one letter to each character in the cipher text will eventually reveal the secret message.

 JHU FVB YLHK AOPZ (Original Cipher text) KIV GWC ZMIL BPQA +1 LJW HXD ANJM CQRB +2 MKX IYE BOKN DRSC +3 NLY JZF CPLO ESTD +4 OMZ KAG DQMP FTUE +5 PNA LBH ERNQ GUVF +6 QOB MCI FSOR HVWG +7 RPC NDJ GTPS IWXH +8 SQD OEK HUQT JXYI +9 TRE PFL IVRU KYZJ +10 USF QGM JWSV LZAK +11 VTG RHN KXTW MABL +12 WUH SIO LYUX NBCM +13 XVI TJP MZVY OCDN +14 YWJ UKQ NAWZ PDEO +15 ZXK VLR OBXA QEFP +16 AYL WMS PCYB RFGQ +17 BZM XNT QDZC SGHR +18 CAN YOU READ THIS +19 (got it!) DBO ZPV SFBE UIJT +20 ECP AQW TGCF VJKU +21 FDQ BRX UHDG WKLV +22 GER CSY VIEH XLMW +23 HFS DTZ WJFI YMNX +24 IGT EUA XKGJ ZNOY +25 JHU FVB YLHK AOPZ +26

The secret message was discovered after 19 movements. You will also appreciate now that no more than 25 combinations will be required to break the Caesar method, as 26 movements return you to the original message.

An improvement on the Caesar method is to use a pseudo-random method of scrambling the alphabet, so that there exists no logical reason why A becomes G, B becomes X, et cetera. An encryption table would allow for easy encoding and it would seem that the only way to decode the message would be using the same table. Unlike the Caesar method, there are 26×25×24×...×2×1 » 4×1026 different random substitution tables that can be used. However, even this system has weaknesses...

Clearly the cipher, NRQ, can never be decoded as it could be any 3-letter word in the English language, assuming the plain text was in English! But applying human insight and/or background to the context can crack even this type of system. For example, if we knew that the cipher was a reply to the question, 'Do you have the documents?', sent earlier, we would be quite sensible to assume the reply was, YES – although we could never be certain.

Surprisingly, by applying logic we can break most substitution codes, as long as we have a sufficiently long piece of cipher text. The longer the cipher text, the more clues we obtain. The following guidelines are used by code breakers in cracking substitution codes:

• Look for the letters that appear most often, as E and T occupy around 20% of English sentences. Every code breaker should memorise, ETOANIRSH, as this represents the order of the most commonly used letters in Enlgish.
• Use the high frequency of E and T to locate words like, THE and ?ET – THE is the most common 3-letter word.
• Identify 1-letter words, as they are almost certainly going to be A or I.
• The most commonly used 2-letter words start with I, as in IF, IN, IS and IT – in fact, IN, IT and IS are among the five most commonly used 2-letter words. This is another clue as to which single letter is I and which is A.

Of course, the last two points depend on spaces being left in the cipher text; this is called informal substitution. If spaces are removed, we call it formal substitution. Sometimes this is compounded by splitting the cipher string (with spaces removed) into regular size groups, of say five letters, to confuse the code breaker. For a more detailed account of the distribution of letter/word frequency in the English language, see Frequency lists. In addition, a substitution system does not necessarily need to exchange the Latin alphabet, ABCD..., for Latin characters. You could use the Greek alphabet, numbers or even symbols/pictures. But ultimately the same principles mentioned above can be used to decode any type of monoalphabetic substitution code.

The use of Polyalphabetic substitution allows for much more security.