Cipher Decipher
All posts
Cryptanalysisfrequency-analysissubstitution-ciphercryptanalysis

Frequency Analysis Explained: How to Break Any Substitution Cipher

Al-Kindi discovered frequency analysis in 9th-century Baghdad. The technique still breaks CTF substitution ciphers today. Here is how it works and how to apply it.

July 1, 20259 min read

Al-Kindi solved encrypted Arabic manuscripts in Baghdad around 850 CE using a technique that still breaks CTF challenges today. He documented it in Risalah fi Istikhraj al-Mu'ammaA Manuscript on Deciphering Cryptographic Messages — which is the earliest known text on frequency analysis and possibly the earliest known work on systematic cryptanalysis.

The technique does not require algebra, computing, or advanced mathematics. It requires counting. Paste any monoalphabetic substitution ciphertext into the Letter Frequency Analyzer and the distribution will make the shift obvious within seconds.

What Monoalphabetic Substitution Means

A monoalphabetic substitution cipher replaces each letter with exactly one other letter, consistently throughout the message. The Caesar cipher is the simplest case: every A becomes D, every B becomes E, with a fixed shift. A more complex substitution cipher uses a random permutation of the alphabet: A might become Q, B might become T, Z might become F — but each letter always maps to the same replacement.

That consistency is the fundamental weakness. Natural language is not random. In English text of any reasonable length, the letter E appears approximately 12.7% of the time. T appears around 9.1%, A around 8.2%, O around 7.5%. These proportions are stable across most written English, regardless of the topic.

When you apply a monoalphabetic substitution, you shift or permute the alphabet but you do not change the frequency distribution — you just relabel it. The most common letter in the ciphertext still corresponds to E in the plaintext. The second most common corresponds to T. And so on.

The Wikipedia letter frequency table and Peter Norvig's corpus analysis at norvig.com/mayzner.html both confirm this distribution across large English corpora.

How Frequency Analysis Works

Letter frequencies in English

The standard frequency order for English is approximately: E, T, A, O, I, N, S, H, R, D, L, C, U, M, W, F, G, Y, P, B, V, K, J, X, Q, Z. The mnemonic ETAOIN SHRDLU covers the most common twelve letters. This order was so well known in the typesetting industry that Linotype machines arranged their keys in frequency order — ETAOIN SHRDLU is the first two rows.

Bigrams and trigrams

Single-letter frequency is a first-pass attack. Bigrams (two-letter pairs) and trigrams sharpen the analysis. The most common English bigrams are TH, HE, IN, ER, AN, RE, ON, EN, AT. The most common trigrams are THE, AND, ING, HER, HAT, HIS, THA, ERE. If a ciphertext trigram appears far more often than any other, it is almost certainly THE.

The attack procedure

1. Count the frequency of every letter in the ciphertext. 2. Rank them from most to least frequent. 3. Map the most frequent ciphertext letter to E, the second to T, and so on down the ETAOIN SHRDLU order. 4. Partially decrypt the ciphertext with these guesses. 5. Look for nearly-complete words — a three-letter sequence where two letters are known often reveals the third. 6. Adjust mappings where partial words do not make sense. 7. Repeat until full plaintext is recovered.

The Substitution Cipher Helper lets you work through this interactively, swapping letter assignments as you go.

Worked Example

Consider this 180-character ciphertext:

WKHU LV D WUDFH RI WKH RULJLQDO PHVVDJH KLGGHQ LQ WKH SDWWHUQ RI OHWWHUV. IUHTXHQFB DQDOBVLV ZLOO UHYHDO LW. WKH PRVW FRPPRQ OHWWHU KHUH LV H.

Step 1 — count frequencies. H appears 14 times (highest). Z appears 0 times.

Step 2 — map H → E (the most common English letter).

Step 3 — look for single-letter words. D appears alone: D → A or D → I. Try D → A first.

Step 4 — the sequence WKH appears three times. If H → E, then WKH is _ _ E. The most common English three-letter word ending in E is THE. Try W → T, K → H.

Step 5 — WKH LV becomes THE __. LV → IS gives THE IS. The context suggests LV → IS fits.

At this point, substituting T, H, E, A, I, S reveals enough of the text that remaining letters become obvious from word patterns. This is a Caesar-3 ciphertext (the shift from the previous section), but the frequency analysis method works equally well on random permutation ciphers where brute force is not an option.

The Cryptogram Solver automates the statistical mapping, completing most cryptograms without manual iteration.

Practical Applications

CTF cryptogram challenges: The "Crypto" category on platforms like picoCTF and CryptoHack routinely includes monoalphabetic substitution puzzles. The expected solver chain is: paste ciphertext → frequency analysis → letter mapping → word pattern verification.

Newspaper cryptograms: Published cryptograms (American Cryptogram Association puzzles, newspaper daily puzzles) are all monoalphabetic substitutions. The solving technique is the same as the CTF approach, done mentally or with pattern matching.

Cipher type detection: Frequency analysis tells you more than just the plaintext — it tells you what kind of cipher you are dealing with. If the IC is around 0.065 (English-like), the text is either plaintext or a monoalphabetic cipher. If it is near 0.038 (random-like), it is likely a polyalphabetic or transposition cipher. The Letter Frequency Analyzer shows the full distribution chart.

Limitations

Frequency analysis fails on short ciphertexts. A 30-character sample does not produce a stable frequency distribution — the letter E may not appear at all, or may appear three times by chance. The technique becomes reliable at around 100 characters and highly reliable above 300.

Polyalphabetic ciphers — like the Vigenère cipher — defeat naive frequency analysis because the same plaintext letter maps to different ciphertext letters at different positions. The frequency distribution flattens toward uniform. Frequency analysis must be combined with the Kasiski examination and Index of Coincidence to work on polyalphabetic ciphertexts.

Modern block ciphers (AES in any standard mode) produce ciphertext that is computationally indistinguishable from random. Frequency analysis reveals nothing.

Frequently asked questions