Using one-way hash functions as the encryption method

Question

Suppose two parties want to communicate securely with each other (Bob and Alice) using a simple messaging system in English. There are approximately 180,000 currently used words in the English language.

The protocol

Bob creates a random private key of 512 bits in length e.g. 2fba5541fda58c643524cb629cb310674d029d7dd688e974f9c0d95299c228fa3531d06a29a69b6715ad4ec074d2bb50393fe5b4e7d2de5bc83b10ac7d3114ff
Bob also chooses a random hash function from a group of hash functions that all output 512 bits. These could be SHA-2, Keccak, Grøstl, BLAKE, JH, Whirlpool and Skein hash functions.
Bob gives the key and chosen hashing method to Alice (either in person or using some other key exchange method).
Using a program and the dictionary of 180,000 words, Bob and Alice compute HMAC(key, word) for each word in the dictionary using the chosen hash function. Each person calculates this on their machine.
The values are stored in a lookup table with columns (id, word, hmac) in an indexed database on each person's machine. With a HMAC generation time of 23 milliseconds per word on a single core (reference machine: Core i5 @2.67GHz), this would take 69 minutes to construct a database of values given 180,000 words. Using all 4 cores, splitting up the work and using multithreaded programming this could be dropped to 17 minutes to generate the full database on each machine. This is a one-time generation, so no big deal.
Bob writes message to Alice.
For encryption, the program selects each plaintext word in the message, then looks up the corresponding HMAC hash value in the database then replaces the plaintext word with that hash value. A database lookup for each word and corresponding hash will be very fast in a modern, indexed database.
The hashes for each word are concatenated together so it is one full string of unintelligible text. This makes up the ciphertext.
A message authentication code is computed on the text using HMAC(key, ciphertext) and sent with the message e.g. ciphertext || MAC (concatenated).
Alice receives the encrypted message. Grabs the last 512 bits of the message which is the MAC. Then validates the MAC with her copy of the key and the hash algorithm using HMAC(key, ciphertext). Incorrect MACs mean the message is discarded immediately.
Alice's program breaks up the message into the hashed words as it knows each word was hashed into a fixed length 512 bit hash.
Alice's program looks up each hashed word in her database's dictionary of words to retrieve the corresponding plaintext word. The program assembles the message back to plaintext for her. This database lookup for each word and corresponding hash should also be very fast in a modern, indexed database.

Security features:

The strength of the protocol relies on the difficulty in computing a pre-image of the HMAC hash of one of the hashed words to find the key. This would be 2⁵¹² for a standard computer using a brute force search or 2²⁵⁶ using Grover's algorithm.
There is also security in that the attacker does not know which hash function that was used by either party. They will have to try brute forcing using all available 512 bit hash algorithms.
All hashed words compute to the same length output e.g. 512 bit. This means shorter length words are indistinguishable from longer ones.
If the hashed output and MAC are concatenated as a single string and sent, then the output is indistinguishable from a random string.

Potential disadvantages:

Frequency analysis of the output might help determine simple words in the ciphertext such as 'the' etc if that word is repeated and sent multiple times. This isn't necessarily a problem as it's only a simple word and doesn't convey much meaning to the message. You can also avoid that by not using simple words at all in the message.
The message doesn't account for numbers or punctuation. Solution would be to send simple messages such as 'LEAVE MIDNIGHT' then any further sentences could be sent in another message. Also numbers could be expanded to their word equivalent e.g. '2' becomes 'TWO'.
The time to generate the dictionary of words/hashes can be time consuming. As mentioned it may only take 17 minutes using a fast computer and multithreaded program to create the dictionary on each computer. With a Core i7 or ever increasing speeds of computers it would be even faster. From there the same dictionary could be transferred to any other devices needing to communicate.
The encrypted output is much longer than the message itself. Not really a big deal considering today's networking and storage capabilities.
Assuming an attacker knew the protocol, they may be able to discern how many words were in the message based on the length of the ciphertext. This could be fixed by using padding words, and setting a fixed length message size.

Feedback

Now I would appreciate some constructive feedback.

Are there any similar encryption systems currently in use?
What else could go wrong with the protocol?
What are some more downsides and mitigations?
What are some improvements that could be made?

Updates after feedback

After feedback regarding the scheme it is vulnerable to chosen-plaintext attack and frequency analysis of past messages unless each word is only ever used once. This is not particularly usable so I devised some changes.

A random nonce/IV is created for each message. This will be 512 bits and sent with the message. To encrypt, Bob computes HMAC(key, nonce || word) for each word in the message. || indicates concatenation. Encryption will be very fast and a lookup dictionary is no longer needed as it will change each message.

For decryption, Alice constructs a temporary dictionary using HMAC(key, nonce || word) for each word in the dictionary. Alice's program looks up each hashed word in her database's dictionary of words to retrieve the corresponding plaintext word. The program then reassembles the plaintext message.

Only disadvantage is that the decryption of each message will be slower. This could be made faster by using more powerful hardware, e.g. faster, more modern CPUs or multi-cpu/multi-core server system and splitting up the work for each core. Decryption time decreases with more powerful hardware. The scheme still maintains security in that it is very difficult (2⁵¹²) for an attacker to find a pre-image and retrieve the original key especially when they don't know the exact hash algorithm that is being used. This would be ideal for organisations with dedicated hardware for sending and receiving messages.

For modern PCs at the moment it may be possible to reduce the output size of the hash algorithms e.g. 256 bit to decrease the decryption time for each message, while still maintaining a good security margin. Quantum cryptography would reduce that to 2¹²⁸ to find a pre-image however which is still a long time even for an attacker with supercomputers. I would not lower the hash size below that.

D.W. · Answer 1 · 2013-09-03T17:55:41.397

This is highly insecure, for the same reason that ECB mode and simple substitution ciphers are. Every time you use the word the in your message, it will be encrypted the same way. The same goes for other, lower-frequency (but still fairly common) words -- like as or with or will (or any of hundreds of other examples).

This is a humongous clue to cryptanalysts. For instance, if we know that one word is will, then it's likely that the next word is a verb. If we know that another word is because, then that gives us some information about the sentence structure. Word frequency analysis might reveal a lot of words, and then the pairing and location of known words might give us some clue about what is being said or about what nearby words might be -- and each clue that lets us deduce some additional information will give us additional clues that can be used to deduce still more, and so on.

And, of course, this scheme is completely insecure against known-plaintext attacks. If I know a few of the messages that were sent, I learn the decoding of all of the words in those messages, and that may help me recover parts of other messages. This vulnerability is considered unacceptable for a modern cryptographic scheme; similar sorts of vulnerabilities have been exploited in the past to break other cryptosystems in practice.

A hint: You generally don't want to try to design your own scheme. Cryptographers have spent a lot of mental energy trying to design good schemes; it's unlikely that whatever you come up with will be better than the best that the entire field was able to do. Instead, stick to standard constructions -- in this case, use authenticated encryption, and save your mental energy for other pursuits.

Ilmari Karonen · Accepted Answer · 2013-09-03T20:21:09.543

Your scheme would make a nice puzzle for amateur codebreakers. That's about the best that can be said for it.

It does not meet the generally accepted standards for a modern encryption scheme; in particular, it is not semantically secure. In fact, the security of your scheme would be seriously compromised if an attacker obtained even a small amount of known plaintext, and, if used to encrypt sufficient amounts of text, it could even be vulnerable to ciphertext-only attacks using frequency and correlation analysis.

In particular, note that one of your claimed "security features" does not hold: if the plaintext contains any repeated words, the resulting ciphertext can be easily distinguished from a random string by the presence of repeated 512-bit blocks (which, in a truly random string, would be astronomically unlikely).

That said, if you really want a secure word-by-word encryption scheme using HMAC, here's a variation of your scheme that would work:

Bob and Alice choose a 512-bit hash function. This choice can be public, so let's just pick SHA-512 and be done with it.
Bob and Alice somehow arrange to share a secret key K₁. The key can have any size, as long as it's long enough to resist brute force guessing attempts; 128 bits or more should be fine.

Bob and Alice also both modify K₁ in some agreed-upon way to obtain a second shared key K₂ ≠ K₁. The modification can be completely trivial, such as flipping the first bit in the key; we just need the two keys to be different; see below for the reason. (For technical reasons, however, appending a null byte or hashing K₁ to obtain K₂ should be avoided. Pretty much any other method is OK.)
To send a message to Alice, Bob splits the message into words (of no more than 64 bytes = 512 bits each).

Optionally, Bob also chooses a "nonce" word or phrase and appends it to the message. This nonce word could be random, or it could just be e.g. a message number. It's also a good idea for Bob to include a timestamp and a channel designator (e.g. "from Bob to Alice") in the message to guard against replay attacks; if unique, these can serve as the nonce.

First, Bob computes the HMAC of the entire message (as it would look when decrypted by Alice, including the nonce, if any) using the chosen hash function and the shared key K₁. This will form the first 64 bytes of the ciphertext.

To encrypt each word of the message, Bob computes the HMAC of the previous 64 bytes of the ciphertext using the key K₂, XORs the word to be encrypted (padded e.g. with nulls to 64 bytes) with the HMAC output, and appends the resulting 64-byte string to the ciphertext.
To decrypt the first word of the message, Alice takes the first 64 bytes of the ciphertext, computes the HMAC of them (using the key K₂) and XORs the result with the next 64 bytes of the ciphertext. Then she repeats the process for all subsequent pairs of 64-byte blocks to decrypt the rest of the message.

Finally, Alice computes the HMAC of the entire decrypted message using the key K₁ and compares it with the first 64 bytes of the ciphertext. If they match, she'll know that the message is from Bob (or someone else who knows the key) and hasn't been tampered with.

This scheme has a number of advantages compared to yours:

It doesn't need a database.
It can handle arbitrary words (except for the 64 byte length limit, which shouldn't be an issue in practice), even if they're not found in any dictionary.
It can also handle punctuation, e.g. by treating it as a separate word, or by appending it to the preceding word.
If each message includes a unique nonce, it's IND-CCA2 secure, i.e. it meets the highest security requirements expected of a modern encryption scheme.
Even if nonces are not used (or are accidentally reused), IND-CCA2 security is only compromised to the extent that an attacker may learn if the same message is sent twice.

Ps. A careful reader may note that the requirement that Bob split them message into "words" in order to encrypt it is somewhat superfluous: it would be much easier for Bob to just split his message into 64-byte blocks.

This also saves Bob from having to worry about padding the blocks up to 64 bytes, except possibly for the last block. For the last block, instead of padding the plaintext, Bob can (if he doesn't mind divulging the exact length of the plaintext) just truncate the HMAC output to the length of the last plaintext block.

With those modifications, the scheme I've described above is essentially a simplified version of SIV mode, except built on HMAC and CFB mode instead of CMAC and CTR mode, and using HMAC also as a replacement for the block cipher for encryption. All these modifications should be safe, although they rely heavily on the security properties of HMAC (in particular, on HMAC-SHA512 being a PRF). The nice thing about SIV mode is that it's "maximally misuse-resistant"; it's really hard to make its security fail catastrophically, no matter what you do.

Correction: The protocol as I originally suggested it would've been vulnerable to a serious chosen-plaintext attack if used without a nonce: if Eve could trick Alice or Bob to encrypt a message of her choice, she could submit any 64-byte block from an earlier ciphertext as the message in order to learn its HMAC, and thus learn the corresponding word in the earlier message. I've modified the protocol to use different HMAC keys for the authentication and encryption parts, which should plug this hole. The moral of the story being, be careful about key reuse!

score 5 · Answer 3 · answered Sep 03 '13 at 12:10

"Frequency analysis of the output might help determine simple words in the ciphertext such as 'the' etc if that word is repeated and sent multiple times. This isn't necessarily a problem as it's only a simple word and doesn't convey much meaning to the message".

If the word "the" doesn't convey much meaning, then why have you used that particular word 59 times in your question? The answer is simple - because it is important and it does convey meaning. It's a basic linguistic structure. Saying it's not important is like a programmer saying that curly braces, semi-colons, and brackets are not important (depending on the syntax of the language) simply because they appear frequently.

Having to avoid the word "the" is highly suggestive of your protocol/system being weak.

score 4 · Answer 4 · answered Sep 03 '13 at 12:17

This is a type of code book security. Code books can be very strong or very weak depending on operational security.

If you never reuse a code book word even in a single message and the code words are genuinely random - this approach could work.

Of course if you can't reuse code words and need perhaps 40 instances of THE and 30 instances BE to avoid running out of them in a conversation, the database will be larger and less convenient than sharing a generous one-time-pad when Alice and Bob first met.

Incidentally, Windwalkers are somewhat different as their language isn't Germanic and the required frequency analysis would be unorthodox. For "hash languages" I suggest choosing an uncommon language of non-Germanic non-Romantic origin that no one has compiled a frequency table for. Perhaps Tolkien Elvish?

score 2 · Answer 5 · answered Sep 03 '13 at 10:09

This scheme is very unsecure.

In my humble opinion is like a complicated "translate your message into a unknown language".

In my opinion it looks like an hashed version of the Windtalkers (WW2 native americans language used to encrypt messages). Your version add one more level: you have few languages (hash functions) to choose among.