Search
Items tagged with: HMAC
Last year I wrote a grab-bag post titled, Don’t Forget To Brush Your Fur, because I’m terrible at SEO or making content easily discoverable.
In the same vein as that previous example, this is going to be in the style of Lightning Round talks at technology conferences.
Why are we doing this again?
I maintain a running list of things to write about, and cross ideas off whenever I cover a topic.
After a few months of doing this, I realize most of what remains is kinda interesting but not quite interesting enough to warrant a dedicated entry.
It always needs more isogenies.
(Art by Lynx vs Jackalope)
Contents
- Asymmetric Key Wear-Out
- HMAC Wear-Out?
- Asymmetric Commitments
- Against “Fluffies”
- A Meditation on Furries and Cringe
- Furries and Blue State Privilege
Asymmetric Key Wear-Out
Last year, I wrote about cryptographic wear-out for symmetric encryption. That post has attracted quite a bit of feedback from folks requesting comparisons against other block cipher modes, etc. One topic that I didn’t see requested much, but is equally interesting, is how this reasoning can be applied to asymmetric cryptography (if at all).
Let’s get one thing clear: Cryptography keys don’t “wear out” in the same sense as a physical key might. What we’re talking about is an ever-increasing risk of a collision occurring in random nonces.
ECDSA Key Wear-Out
ECDSA signatures involve a one-time secret, k. The scalar multiplication of k and the base point for the curve is encoded as half of the signature (r
), while its modular inverse is multiplied by the sum of the truncated message hash and the product of r
and the secret key to produce the other half of the signature (s
).
If your selection of k is biased, or k is ever reused for two different messages, you can leak the secret key.
Strictly speaking, for any given ECDSA curve, there is only one k
value that corresponds to a given r
for all users (n.b it’s not distinct per keypair).
This means that all users of e.g. ECDSA over NIST P-256 have to worry about a shared cryptographic wear-out: After 2^112 signatures, there is a 2^-32 chance of a single collision occurring.
Fortunately, the search space of possible k-values is enormous, and this will not impose a real-world operational risk in the near future. If you’re worried about multi-user attacks, P-384 gives you a wear-out threshold of 2^176 messages, which we’re probably never going to achieve.
RSA Key Wear-Out
In order to calculate the wear-out for an RSA message, you first have to begin with an attack model. Previously, we were looking at algorithms that would become brittle if a nonce was reused.
RSA doesn’t have nonces. You can’t attack RSA this way.
But let’s assume that such an attack did exist. What might the safety limit look like? There are two remaining possible considerations for RSA’s security against cryptographic wear-out: Key size and padding mode.
RSA private keys are two prime numbers (p, q). RSA public keys are the product of the two primes (n) and a public exponent (e) that must be coprime to (p-1)(q-1). (In practice, e is usually set to 3, 65537, or some other small prime.)
The security of RSA is subexponential to key size, based on the difficulty of integer factoring attacks and the requirement for p and q to be prime numbers.
This primeness restriction doesn’t apply to your message. The padding mode dictates your upper limit on message size; e.g., PKCS#1 v1.5 padding will take up at least 3 bytes:
- For encryption,
x = 0x00 || 0x02 || r || 0x00 || m
, wherer
is random padding bytes (minimum 8 bytes). - For signatures,
x = 0x00 || 0x01 || 0xFF..FF || 0x00 || m
. - In either case, the padding is always at least 11 bytes long.
So if you have 2048-bit RSA keys, you can encrypt or sign up to 245 bytes (1960 bits) with PKCS#1 v1.5 padding. This corresponds to a safety limit of 2^974 messages.
(Art by Lynx vs Jackalope)
HMAC Wear-Out?
To keep things simple, the security of HMAC can be reduced to the collision risk of the underlying hash function.
If you’re trying to estimate when to rotate symmetric keys used for HMAC, take the birthday bound of the underlying hash function as your starting point.
- For SHA-256, you have a 50% chance of a collision after 2^128 messages. For a 2^-32 chance, you can get 2^112 messages out of a single key.
- For SHA-384, this is 2^176 messages.
- For SHA-512, this is 2^240 messages.
In either case, however, these numbers might as well be infinity.
With apologies to Filippo Valsorda.
Asymmetric Commitments
Did you know that fast MACs such as GHASH and Poly1305 aren’t random-key robust? This property can matter in surprising ways.
Did you know that ECDSA and RSA don’t qualify for this property either? This is related to the topics of malleability and exclusive ownership. You can learn more about this in the CryptoGotchas page.
Essentially, if a signature scheme is malleable or fails to provide exclusive ownership, it’s possible to construct two arbitrary (m, pk) pairs that produce the same signature.
Any nonmalleable signature scheme with exclusive ownership (i.e. Ed25519 with low-order point rejection and canonical signature checks, as provided by the latest version of libsodium) provides sufficient commitment–mostly due to how it uses a collision-resistant cryptographic hash function. (It’s also worth noting: HashEdDSA doesn’t. Isn’t cryptography fun?)
Generally, if you need random-key robustness, you want to explicitly make it part of your design.
Against “Fluffies”
In my blog post about the neverending wheel of Furry Twitter discourse, I mentioned the controversy around SFW spaces for underage furries.
Everything I said in that post is still accurate (go read it if you haven’t), but I want to emphasize something that maybe some people overlooked.
https://twitter.com/SoatokDhole/status/1426638694786682884
Underage furries calling themselves “fluffies” is a bad idea, for two reasons.
Divide and Conquer
The first reason is tactical, and not specific to what they’re calling themselves: If you label yourselves separately from the larger furry community, you make it much easier to be targeted–especially by propaganda. There’s a severely disturbed alt-right fringe to the furry fandom (dubbed alt-furry, the Furry Raiders, and so many other names) that would love nothing more than to sink their claws into younger furs.
It’ll start innocently enough (“Yay, you have your own space!”), but it will quickly accelerate (“Congrats on kicking those degenerates to the curb!”) to horrible places (“All LGBTQIA+ people are degenerates”), gliding on the wings of edgy humor.
This descent into madness is also known as the PewDiePipeline and all parents of furries should be made aware of it, lest it happen to their child:
https://www.youtube.com/watch?v=pnmRYRRDbuw
It bears emphasizing: This existence of a PewDiePipeline within the “fluffy” space is not predicated on the intentions of the proponents. They can have all the best intentions in the world and it will still happen to their microcosm.
https://twitter.com/ARCADEGUTS/status/1425687280983937027
Preventing this from happening will require an almost inhuman degree of vigilance and dedication to correcting discourse from going sour. None of us are omniscient, so I wouldn’t take that bet.
Pre-Existing Terminology
The second reason the “fluffies” label is a bad idea is more specific to the word “fluffies” in particular: It already refers to a very disturbing meme on 4chan from not-very-many years ago: Fluffy Abuse Threads.
I’m intentionally not including any videos or images of this topic. There just aren’t enough content warnings for how gross this content is.
By calling yourselves “fluffies”, the most deranged 4chan-dwellers and/or Kiwi Farms lurkers on the Internet will begin associate you with the “fluffy abuse” memes, and may even act accordingly. In their twisted minds, they may even rationalize their conduct as if somehow you’re consenting to the abuse, by virtue of what you call yourselves.
Look, I get it: When you’re young, the over-sexualization of the media can be very uncomfortable, and it’s natural to want to avoid it. Additionally, it’s only human to want your own special club with a special name to hang out with your exclusive (n.b. same-age) community.
But please think carefully about what you’re doing, how you’re doing it, and which adults you decide to trust.
Also: maybe talk to older queers and/or furries about the history of the Furry Fandom, Pride, and kink before you make dangerous moves that make you more vulnerable to the worst humanity has to offer? Even if you don’t agree with us, we don’t want to see you get hurt.
There definitely is room in the furry fandom for people who are not comfortable with sexual content, or simply don’t want to be inundated with it all the damn time. It doesn’t need to be an exclusive thing or concept; instead, it should be normalized.
Ultimately, there’s probably a lot of work to be done to ensure kids and families have a safe and enjoyable furry con experience during daylight hours without repressing the queer and sexual identities of consenting adults at night. The best way to get from here to there is to talk, not to isolate.
Otherwise, we’ll keep seeing occurrences like this:
https://twitter.com/PrincelyKaden/status/1426192114694692866
The onus here is going to be largely on furry convention staff and chatroom moderators to actually listen to people reporting abusive behavior. They haven’t always been good about that, and it’s time for change.
https://twitter.com/MegaplexCon/status/1425966589241970693
A Meditation on Furries and Cringe
Every once in a while, I get a comment or email like this one:
https://twitter.com/SoatokDhole/status/1360835077899436033
The biggest magnet for poorly-reasoned hate comments is, surprisingly, my tear-down of the “sigma male” meme.
You’d think the exposure of TheDonald’s non-CloudFlare IP address would draw more ire than having correct opinions on masculinity, but here we are.
Art by Sophie
Let’s talk about masculinity for a moment, guys.
There is nothing manlier than being your authentic self. Even if that means liking some “girly” things. Even if that means being soft and vulnerable at times. Even if that means actually conforming to some stereotypes perpetuated by toxic masculinity when it coincides with your likes and interests. You do you.
But this isn’t just true of the male gender. Authenticity is the epitome of humanity. There’s nothing that stops women and enbies from being ruthlessly themselves.
You can’t be authentic when you’re participating in Cringe Culture, which blindly tears large swaths of people down to stoke the feelings of superiority in the people who evade its blast.
People are weird. I’m weird. I don’t expect everyone to like me, nor do I want them to. (Parasocial relationships suck!)
It’s okay to be a little obsessed about something other people look down on just because you happen to like it. Just make sure you’re not eschewing your adult responsibilities. (We all have bills to pay and promises to keep to the people that matter to us.)
If people don’t like you because you suddenly revealed your fondness for classic video games, rock-tumbling, or linear algebra? Fuck ’em. May the bridges you burn light the way to people who will appreciate you for who you truly are.
I’ve been told my blog is “weapons grade cringe” before, because I dared talk about encryption while having what, to most adults, comes across as little more than a cartoon brand or company mascot.
(Art by Lynx vs Jackalope)
Furries and Blue State Privilege
I sympathize with most queer people and/or furries for not wanting to subject themselves to the bigotry that runs rampant in Red States, but the ones who are jerks to other members of their community for living in those states, I can do without.
https://twitter.com/SarahcatFursuit/status/1413566747148435456
Being an asshole to someone because they live in, or are moving to, a state whose politics you dislike is equal parts stupid, selfish, and self-defeating:
- It’s stupid because there’s no reason for expressing prejudice or painting with broad brushes. For example: “Florida Furs are bad people” is an attack on the author of this blog.
- It’s selfish because not everyone who wants to leave these states has the resources or opportunity to do so, so all you’re doing is shining a spotlight on your own privilege. Way to show your entire ass to the community.
- It’s self-defeating because of the way the U.S. political system is architected:
If you wished for a genie to move every LGBTQIA+ person to the west coast of the United States, within a few years you’d essentially reduce support for LGBTQIA+ rights to approximately 6 out of 100 votes in the US Senate and 68 out of 435 in the House of Representatives.When you factor in who owns the land in the big tech cities (San Francisco, Seattle, etc.) and how much political and economic power they wield, it becomes very clear that your shaming of others for not boarding the bandwagon serves the interests of the worst of humanity: Landlords and venture capitalists.
Not a good move for people who claim to be progressive, and want to achieve progressive political outcomes nationwide.
The fact that some states have horrendous laws on the books, even worse bastards enforcing these laws, and somehow even more terrible politicians gatekeeping any meaningful progress from changing the system isn’t ever going to be improved from the outside.
I say all this, and I acknowledge Florida does suck in a lot of obvious ways: Our governor (Ron DeSantis) has a disposition that would actually be improved if he wore clown make-up to press appearances. We also have far too many furries that are anti-maskers, anti-vaxxers, or both.
https://twitter.com/SoatokDhole/status/1300911840000708608
But when furries go out of their way to shame someone, simply for living here? You’re not helping. Seriously stop and think about your priorities.
And maybe–just maybe–be surgically precise when you decide insults are warranted.
Now that I’ve flushed the blog post topic buffer, I’m fresh out of ideas. Let me know some topics that interest you in my Telegram group so I don’t get bored and eventually write Buzzfeed-quality crap like this:
In hindsight, ideas like this are 90% of the reason Cringe Culture refuses to die.
https://soatok.blog/2021/08/16/lightning-round/
#asymmetricCryptography #ECDSA #Florida #furries #FurryFandom #HMAC #Politics #RSA #Society #wearOut
There are a lot of random topics I’ve wanted to write about since I started Dhole Moments, and for one reason or another, haven’t actually written about. I know from past experience with other projects that if you don’t occasionally do some housekeeping, your backlog eventually collapses under its own gravity and you can never escape from it.So, to prevent that, I’d like to periodically take some time to clean up some of those loose ends that collect over time.
Random-Access AEAD
AEAD stands for Authenticated Encryption with Associated Data. Typically, AEAD constructions involve a stream cipher (which may also be a block cipher in counter mode) and a message authentication code (which may also be an almost-universal hash function).AEAD modes are designed for one-shot APIs: Encrypt (then authenticate) all at once; (verify then) decrypt all at once. AES-GCM, ChaPoly, etc.
AEADs are less great at providing random access to the underlying plaintext. For example: If you’re encrypting a 240 GB file with AES-GCM, but you only need a 512 KB chunk at some arbitrary point in the file, you’re forced to choose between either:
- Authenticating the rest of the AES-GCM ciphertext, then decrypting only the relevant chunk. (Performance sucks.)
- Sacrificing integrity and decrypting the desired chunk with AES-CTR.
Being forced to choose between speed and security will almost certainly result in a loss of security. The incentives of software developers (especially with fly-by-night startup engineers) all-but-guarantee this outcome.
Consequently, there have been several implementations of streaming-friendly AEAD. The most famous of which is Phil Rogaway’s STREAM construction.
Source: Rogaway’s paper
The downside to STREAM is that it requires an additional T bytes (e.g. 16 for an 128-bit authentication tag) for each chunk of the plaintext.
A similar solution, as implemented in the AWS Encryption SDK, is to carefully separate plaintexts into equal-sized frames and have special rules governing IV/nonce selection. This lets you facilitate random access while still making the security of the whole system easy to reason about.
Can we do better than STREAM and message framing?
The most straightforward idea is to use a Merkle tree on the ciphertext with a stream cipher for extracting a distinct key for each leaf node. This can be applied to existing AEAD ciphertexts, out of band, to create a sort of deep authentication tag that can be used to authenticate any random subset of the message (provided you have the correct nonce/key).
However, I haven’t found the time to develop this idea into something that can be toyed with by myself and other researchers.
More Introductory Articles
Let’s face it:Art by Riley
I’ve previously suggested an alternative strategy for programmers to learn cryptography. I’d like to do more posts covering introductory material for the topics I’m familiar with, so anyone who wants to actually employ my proposed strategy can carry themselves across the finish line.
Dissecting Dog-Whistles
Random fact: My fursona is a dhole–also known as a whistling dog.Soatok is a dhole, not a fox. Art by Khia.
Coincidentally, I’m deeply fascinated by language, and planned to start a series analyzing dog-whistle language (especially the kind commonly used against queer subcultures).
However, the very nature of dog-whistle language provides a veneer of plausible deniability for the whistler’s intent, which makes it very difficult to address them in a meaningful way that doesn’t undermine your own credibility.
So, for the time being, this is on the back-burner.
Reader Questions
I’ve received quite a few questions via email and social media since I started this blog in April. The most obvious thing to do with these questions would be to periodically collate a bunch of them into a Questions and Answer style post.However! I have an open source projected called FAQ Off that is way more efficient at the Q&A format than a long-form blog post. If you’d like to see it in action, start here.
Art by Kyume
General Punditry
I make a lot of dumb jokes, typically involving puns and other wordplay. Most of these live in private Telegram conversations with other furries, but a few have leaked out onto Twitter over the years.Is automated vulnerability scanning a nessusity?— Mastodon: soatok@furry.engineer, Cohost: soatok (@SoatokDhole) November 23, 2017
Nurse: "I suspect this patient attempted to shove a foreign object into their urethra for pleasure"
Doctor: "I believe your theory is sound"
— Mastodon: soatok@furry.engineer, Cohost: soatok (@SoatokDhole) June 25, 2018
A lot of them involve queer lingo.
People say it's lonely at the top.No wonder there's so many bottoms in this fandom 😛
— Mastodon: soatok@furry.engineer, Cohost: soatok (@SoatokDhole) December 30, 2019
BitTorrent users are thirsty bottoms. Always complaining about wanting more seed.
— Mastodon: soatok@furry.engineer, Cohost: soatok (@SoatokDhole) August 4, 2019
My RAID controller has big disk synergy
— Mastodon: soatok@furry.engineer, Cohost: soatok (@SoatokDhole) August 1, 2018
Some of them involve furry in-jokes.
Q) Why are foxes so prevalent in the furry fandom?A) We're a sub-culture not a dom-culture.
— Mastodon: soatok@furry.engineer, Cohost: soatok (@SoatokDhole) May 24, 2018
Intrusion detection systems are old hat. What we need is a protrusion detection system.
Introducing OwO
— Mastodon: soatok@furry.engineer, Cohost: soatok (@SoatokDhole) February 1, 2018
Some are just silly.
Using mined bitcoins to buy a pumpkin spice latte makes you an ASIC bitch, right?— Mastodon: soatok@furry.engineer, Cohost: soatok (@SoatokDhole) February 6, 2018
So in gay male furry culture if you give into a booty call from your ex-boyfriend… does that mean you were craving the XD?
— Mastodon: soatok@furry.engineer, Cohost: soatok (@SoatokDhole) August 25, 2018
If SQL is pronounced "sequel" then PHP must be pronounced "fap".
— Mastodon: soatok@furry.engineer, Cohost: soatok (@SoatokDhole) July 23, 2019
What do you call a submissive dragon with a mathematics background who's already lubed up for you?
A sliding scale.
— Mastodon: soatok@furry.engineer, Cohost: soatok (@SoatokDhole) December 26, 2018
Did you hear about the clairvoyant babyfur that broke RSA?
Turns out, all you needed was a padding oracle.
— Mastodon: soatok@furry.engineer, Cohost: soatok (@SoatokDhole) October 1, 2017
I should look for my next partner in a nuclear chemistry lab.
I hear they're good at dating.
— Mastodon: soatok@furry.engineer, Cohost: soatok (@SoatokDhole) December 8, 2016
In my humble opinion, there haven’t been nearly enough puns on this blog (unless the embedded tweets above count).
Normally, this is where I’d proclaim, “I shall rectify this mistake” and proceed to make an ass out of myself, but I don’t like forced and obvious puns.
A lot of furries get this wrong: “Pawesome” is not clever, unless you’re talking to someone with a marsupial fursona. Then maybe.
The best puns come in two forms: They’re either so clever that you never saw it coming, or they’re just clever enough that the punchline lands at the same time you realized a bad pun was even possible.
Only Soatok brand puns are 100% whole groan— Mastodon: soatok@furry.engineer, Cohost: soatok (@SoatokDhole) January 26, 2018
Miscellaneous / Meta
The past few blog posts touched a little on political subjects (especially How and Why America Was Hit So Hard By COVID-19, but this short-term trend actually started with my Pride Month post).At some point in the future, I may write a post dedicated to politics, but for the time being, it’s not really a subject I care enough about in and of itself to emphasize all the time.
Let me be clear: Being gay in America is inherently political. Developing technology is inherently political (although you don’t always realize it). Being a gay technologist, saying something politically significant is an inevitability.
But I’m not interested in the traditional roles and narratives that infect politics and political discourse. Labels are stupid and I’m not interested in being a Useful Idiot for anyone’s propaganda.
The most difficult thing about writing blog posts for me is coming up with a meaningful title. I’ve lost many hours due to the writer’s block that ensues.
The second most difficult thing for me is writing closing statements that aren’t totally redundant.
https://www.youtube.com/embed/l44OV2jlN7A?start=665&feature=oembed
George Carlin – “Count the Superfluous Redundant Pleonastic Tautologies” – Skip to 11:05 if WordPress breaks something
Some bloggers like to sign off like they’re writing an email. “Happy hacking!” and whatnot. To me, this feels forced and inauthentic, like a bad pun.So instead, here’s a totally sick piece of art I got from @MrJimmyDaFloof.
Furry artists are, like the rest of the fandom, amazing.
https://soatok.blog/2020/07/07/dont-forget-to-brush-your-fur/
Earlier this year, Cendyne wrote a blog post covering the use of HKDF, building partially upon my own blog post about HKDF and the KDF security definition, but moreso inspired by a cryptographic issue they identified in another company’s product (dubbed AnonCo).
At the bottom they teased:
Database cryptography is hard. The above sketch is not complete and does not address several threats! This article is quite long, so I will not be sharing the fixes.Cendyne
If you read Cendyne’s post, you may have nodded along with that remark and not appreciate the degree to which our naga friend was putting it mildly. So I thought I’d share some of my knowledge about real-world database cryptography in an accessible and fun format in the hopes that it might serve as an introduction to the specialization.
Note: I’m also not going to fix Cendyne’s sketch of AnonCo’s software here–partly because I don’t want to get in the habit of assigning homework or required reading, but mostly because it’s kind of obvious once you’ve learned the basics.
I’m including art of my fursona in this post… as is tradition for furry blogs.
If you don’t like furries, please feel free to leave this blog and read about this topic elsewhere.
Thanks to CMYKat for the awesome stickers.
Contents
- Database Cryptography?
- Cryptography for Relational Databases
- Cryptography for NoSQL Databases
- Searchable Encryption
- Order-{Preserving, Revealing} Encryption
- Deterministic Encryption
- Homomorphic Encryption
- Searchable Symmetric Encryption (SSE)
- You Can Have Little a HMAC, As a Treat
- Intermission
- Case Study: MongoDB Client-Side Encryption
- Wrapping Up
Database Cryptography?
The premise of database cryptography is deceptively simple: You have a database, of some sort, and you want to store sensitive data in said database.
The consequences of this simple premise are anything but simple. Let me explain.
Art: ScruffKerfluff
The sensitive data you want to store may need to remain confidential, or you may need to provide some sort of integrity guarantees throughout your entire system, or sometimes both. Sometimes all of your data is sensitive, sometimes only some of it is. Sometimes the confidentiality requirements of your data extends to where within a dataset the record you want actually lives. Sometimes that’s true of some data, but not others, so your cryptography has to be flexible to support multiple types of workloads.
Other times, you just want your disks encrypted at rest so if they grow legs and walk out of the data center, the data cannot be comprehended by an attacker. And you can’t be bothered to work on this problem any deeper. This is usually what compliance requirements cover. Boxes get checked, executives feel safer about their operation, and the whole time nobody has really analyzed the risks they’re facing.
But we’re not settling for mere compliance on this blog. Furries have standards, after all.
So the first thing you need to do before diving into database cryptography is threat modelling. The first step in any good threat model is taking inventory; especially of assumptions, requirements, and desired outcomes. A few good starter questions:
- What database software is being used? Is it up to date?
- What data is being stored in which database software?
- How are databases oriented in the network of the overall system?
- Is your database properly firewalled from the public Internet?
- How does data flow throughout the network, and when do these data flows intersect with the database?
- Which applications talk to the database? What languages are they written in? Which APIs do they use?
- How will cryptography secrets be managed?
- Is there one key for everyone, one key per tenant, etc.?
- How are keys rotated?
- Do you use envelope encryption with an HSM, or vend the raw materials to your end devices?
The first two questions are paramount for deciding how to write software for database cryptography, before you even get to thinking about the cryptography itself.
(This is not a comprehensive set of questions to ask, either. A formal threat model is much deeper in the weeds.)
The kind of cryptography protocol you need for, say, storing encrypted CSV files an S3 bucket is vastly different from relational (SQL) databases, which in turn will be significantly different from schema-free (NoSQL) databases.
Furthermore, when you get to the point that you can start to think about the cryptography, you’ll often need to tackle confidentiality and integrity separately.
If that’s unclear, think of a scenario like, “I need to encrypt PII, but I also need to digitally sign the lab results so I know it wasn’t tampered with at rest.”
My point is, right off the bat, we’ve got a three-dimensional matrix of complexity to contend with:
- On one axis, we have the type of database.
- Flat-file
- Relational
- Schema-free
- On another, we have the basic confidentiality requirements of the data.
- Field encryption
- Row encryption
- Column encryption
- Unstructured record encryption
- Encrypting entire collections of records
- Finally, we have the integrity requirements of the data.
- Field authentication
- Row/column authentication
- Unstructured record authentication
- Collection authentication (based on e.g. Sparse Merkle Trees)
And then you have a fourth dimension that often falls out of operational requirements for databases: Searchability.
Why store data in a database if you have no way to index or search the data for fast retrieval?
Credit: Harubaki
If you’re starting to feel overwhelmed, you’re not alone. A lot of developers drastically underestimate the difficulty of the undertaking, until they run head-first into the complexity.
Some just phone it in with AES_Encrypt()
calls in their MySQL queries. (Too bad ECB mode doesn’t provide semantic security!)
Which brings us to the meat of this blog post: The actual cryptography part.
Cryptography is the art of transforming information security problems into key management problems.Former coworker
Note: In the interest of time, I’m skipping over flat files and focusing instead on actual database technologies.
Cryptography for Relational Databases
Encrypting data in an SQL database seems simple enough, even if you’ve managed to shake off the complexity I teased from the introduction.
You’ve got data, you’ve got a column on a table. Just encrypt the data and shove it in a cell on that column and call it a day, right?
But, alas, this is a trap. There are so many gotchas that I can’t weave a coherent, easy-to-follow narrative between them all.
So let’s start with a simple question: where and how are you performing your encryption?
The Perils of Built-in Encryption Functions
MySQL provides functions called AES_Encrypt and AES_Decrypt, which many developers have unfortunately decided to rely on in the past.
It’s unfortunate because these functions implement ECB mode. To illustrate why ECB mode is bad, I encrypted one of my art commissions with AES in ECB mode:
Art by Riley, encrypted with AES-ECB
The problems with ECB mode aren’t exactly “you can see the image through it,” because ECB-encrypting a compressed image won’t have redundancy (and thus can make you feel safer than you are).
ECB art is a good visual for the actual issue you should care about, however: A lack of semantic security.
A cryptosystem is considered semantically secure if observing the ciphertext doesn’t reveal information about the plaintext (except, perhaps, the length; which all cryptosystems leak to some extent). More information here.
ECB art isn’t to be confused with ECB poetry, which looks like this:
Oh little one, you’re growing up
You’ll soon be writing C
You’ll treat your ints as pointers
You’ll nest the ternary
You’ll cut and paste from github
And try cryptography
But even in your darkest hour
Do not use ECBCBC’s BEASTly when padding’s abused
And CTR’s fine til a nonce is reused
Some say it’s a CRIME to compress then encrypt
Or store keys in the browser (or use javascript)
Diffie Hellman will collapse if hackers choose your g
And RSA is full of traps when e is set to 3
Whiten! Blind! In constant time! Don’t write an RNG!
But failing all, and listen well: Do not use ECBThey’ll say “It’s like a one-time-pad!
The data’s short, it’s not so bad
the keys are long–they’re iron clad
I have a PhD!”
And then you’re front page Hacker News
Your passwords cracked–Adobe Blues.
Don’t leave your penguins showing through,
Do not use ECB— Ben Nagy, PoC||GTFO 0x04:13
Most people reading this probably know better than to use ECB mode already, and don’t need any of these reminders, but there is still a lot of code that inadvertently uses ECB mode to encrypt data in the database.
Also, SHOW processlist;
leaks your encryption keys. Oops.
Credit: CMYKatt
Application-layer Relational Database Cryptography
Whether burned by ECB or just cautious about not giving your secrets to the system that stores all the ciphertext protected by said secret, a common next step for developers is to simply encrypt in their server-side application code.
And, yes, that’s part of the answer. But how you encrypt is important.
Credit: Harubaki
“I’ll encrypt with CBC mode.”
If you don’t authenticate your ciphertext, you’ll be sorry. Maybe try again?
“Okay, fine, I’ll use an authenticated mode like GCM.”
Did you remember to make the table and column name part of your AAD? What about the primary key of the record?
“What on Earth are you talking about, Soatok?”
Welcome to the first footgun of database cryptography!
Confused Deputies
Encrypting your sensitive data is necessary, but not sufficient. You need to also bind your ciphertexts to the specific context in which they are stored.
To understand why, let’s take a step back: What specific threat does encrypting your database records protect against?
We’ve already established that “your disks walk out of the datacenter” is a “full disk encryption” problem, so if you’re using application-layer cryptography to encrypt data in a relational database, your threat model probably involves unauthorized access to the database server.
What, then, stops an attacker from copying ciphertexts around?
Credit: CMYKatt
Let’s say I have a legitimate user account with an ID 12345, and I want to read your street address, but it’s encrypted in the database. But because I’m a clever hacker, I have unfettered access to your relational database server.
All I would need to do is simply…
UPDATE table SET addr_encrypted = 'your-ciphertext' WHERE id = 12345
…and then access the application through my legitimate access. Bam, data leaked. As an attacker, I can probably even copy fields from other columns and it will just decrypt. Even if you’re using an authenticated mode.
We call this a confused deputy attack, because the deputy (the component of the system that has been delegated some authority or privilege) has become confused by the attacker, and thus undermined an intended security goal.
The fix is to use the AAD parameter from the authenticated mode to bind the data to a given context. (AAD = Additional Authenticated Data.)
- $addr = aes_gcm_encrypt($addr, $key);+ $addr = aes_gcm_encrypt($addr, $key, canonicalize([+ $tableName,+ $columnName,+ $primaryKey+ ]);
Now if I start cutting and pasting ciphertexts around, I get a decryption failure instead of silently decrypting plaintext.
This may sound like a specific vulnerability, but it’s more of a failure to understand an important general lesson with database cryptography:
Where your data lives is part of its identity, and MUST be authenticated.Soatok’s Rule of Database Cryptography
Canonicalization Attacks
In the previous section, I introduced a pseudocode called canonicalize()
. This isn’t a pasto from some reference code; it’s an important design detail that I will elaborate on now.
First, consider you didn’t do anything to canonicalize your data, and you just joined strings together and called it a day…
function dumbCanonicalize( string $tableName, string $columnName, string|int $primaryKey): string { return $tableName . '_' . $columnName . '#' . $primaryKey;}
Consider these two inputs to this function:
dumbCanonicalize('customers', 'last_order_uuid', 123);
dumbCanonicalize('customers_last_order', 'uuid', 123);
In this case, your AAD would be the same, and therefore, your deputy can still be confused (albeit in a narrower use case).
In Cendyne’s article, AnonCo did something more subtle: The canonicalization bug created a collision on the inputs to HKDF, which resulted in an unintentional key reuse.
Up until this point, their mistake isn’t relevant to us, because we haven’t even explored key management at all. But the same design flaw can re-emerge in multiple locations, with drastically different consequence.
Multi-Tenancy
Once you’ve implemented a mitigation against Confused Deputies, you may think your job is done. And it very well could be.
Often times, however, software developers are tasked with building support for Bring Your Own Key (BYOK).
This is often spawned from a specific compliance requirement (such as cryptographic shredding; i.e. if you erase the key, you can no longer recover the plaintext, so it may as well be deleted).
Other times, this is driven by a need to cut costs: Storing different users’ data in the same database server, but encrypting it such that they can only encrypt their own records.
Two things can happen when you introduce multi-tenancy into your database cryptography designs:
- Invisible Salamanders becomes a risk, due to multiple keys being possible for any given encrypted record.
- Failure to address the risk of Invisible Salamanders can undermine your protection against Confused Deputies, thereby returning you to a state before you properly used the AAD.
So now you have to revisit your designs and ensure you’re using a key-committing authenticated mode, rather than just a regular authenticated mode.
Isn’t cryptography fun?
“What Are Invisible Salamanders?”
This refers to a fun property of AEAD modes based on Polynomical MACs. Basically, if you:
- Encrypt one message under a specific key and nonce.
- Encrypt another message under a separate key and nonce.
…Then you can get the same exact ciphertext and authentication tag. Performing this attack requires you to control the keys for both encryption operations.
This was first demonstrated in an attack against encrypted messaging applications, where a picture of a salamander was hidden from the abuse reporting feature because another attached file had the same authentication tag and ciphertext, and you could trick the system if you disclosed the second key instead of the first. Thus, the salamander is invisible to attackers.
Art: CMYKat
We’re not quite done with relational databases yet, but we should talk about NoSQL databases for a bit. The final topic in scope applies equally to both, after all.
Cryptography for NoSQL Databases
Most of the topics from relational databases also apply to NoSQL databases, so I shall refrain from duplicating them here. This article is already sufficiently long to read, after all, and I dislike redundancy.
NoSQL is Built Different
The main thing that NoSQL databases offer in the service of making cryptographers lose sleep at night is the schema-free nature of NoSQL designs.
What this means is that, if you’re using a client-side encryption library for a NoSQL database, the previous concerns about confused deputy attacks are amplified by the malleability of the document structure.
Additionally, the previously discussed cryptographic attacks against the encryption mode may be less expensive for an attacker to pull off.
Consider the following record structure, which stores a bunch of data stored with AES in CBC mode:
{ "encrypted-data-key": "<blob>", "name": "<ciphertext>", "address": [ "<ciphertext>", "<ciphertext>" ], "social-security": "<ciphertext>", "zip-code": "<ciphertext>"}
If this record is decrypted with code that looks something like this:
$decrypted = [];// ... snip ...foreach ($record['address'] as $i => $addrLine) { try { $decrypted['address'][$i] = $this->decrypt($addrLine); } catch (Throwable $ex) { // You'd never deliberately do this, but it's for illustration $this->doSomethingAnOracleCanObserve($i); // This is more believable, of course: $this->logDecryptionError($ex, $addrLine); $decrypted['address'][$i] = ''; }}
Then you can keep appending rows to the "address"
field to reduce the number of writes needed to exploit a padding oracle attack against any of the <ciphertext>
fields.
Art: Harubaki
This isn’t to say that NoSQL is less secure than SQL, from the context of client-side encryption. However, the powerful feature sets that NoSQL users are accustomed to may also give attackers a more versatile toolkit to work with.
Record Authentication
A pedant may point out that record authentication applies to both SQL and NoSQL. However, I mostly only observe this feature in NoSQL databases and document storage systems in the wild, so I’m shoving it in here.
Encrypting fields is nice and all, but sometimes what you want to know is that your unencrypted data hasn’t been tampered with as it flows through your system.
The trivial way this is done is by using a digital signature algorithm over the whole record, and then appending the signature to the end. When you go to verify the record, all of the information you need is right there.
This works well enough for most use cases, and everyone can pack up and go home. Nothing more to see here.
Except…
When you’re working with NoSQL databases, you often want systems to be able to write to additional fields, and since you’re working with schema-free blobs of data rather than a normalized set of relatable tables, the most sensible thing to do is to is to append this data to the same record.
Except, oops! You can’t do that if you’re shoving a digital signature over the record. So now you need to specify which fields are to be included in the signature.
And you need to think about how to model that in a way that doesn’t prohibit schema upgrades nor allow attackers to perform downgrade attacks. (See below.)
I don’t have any specific real-world examples here that I can point to of this problem being solved well.
Art: CMYKat
Furthermore, as with preventing confused deputy and/or canonicalization attacks above, you must also include the fully qualified path of each field in the data that gets signed.
As I said with encryption before, but also true here:
Where your data lives is part of its identity, and MUST be authenticated.Soatok’s Rule of Database Cryptography
This requirement holds true whether you’re using symmetric-key authentication (i.e. HMAC) or asymmetric-key digital signatures (e.g. EdDSA).
Bonus: A Maximally Schema-Free, Upgradeable Authentication Design
Art: Harubaki
Okay, how do you solve this problem so that you can perform updates and upgrades to your schema but without enabling attackers to downgrade the security? Here’s one possible design.
Let’s say you have two metadata fields on each record:
- A compressed binary string representing which fields should be authenticated. This field is, itself, not authenticated. Let’s call this
meta-auth
. - A compressed binary string representing which of the authenticated fields should also be encrypted. This field is also authenticated. This is at most the same length as the first metadata field. Let’s call this
meta-enc
.
Furthermore, you will specify a canonical field ordering for both how data is fed into the signature algorithm as well as the field mappings in meta-auth
and meta-enc
.
{ "example": { "credit-card": { "number": /* encrypted */, "expiration": /* encrypted */, "ccv": /* encrypted */ }, "superfluous": { "rewards-member": null } }, "meta-auth": compress_bools([ true, /* example.credit-card.number */ true, /* example.credit-card.expiration */ true, /* example.credit-card.ccv */ false, /* example.superfluous.rewards-member */ true /* meta-enc */ ]), "meta-enc": compress_bools([ true, /* example.credit-card.number */ true, /* example.credit-card.expiration */ true, /* example.credit-card.ccv */ false /* example.superfluous.rewards-member */ ]), "signature": /* -- snip -- */}
When you go to append data to an existing record, you’ll need to update meta-auth
to include the mapping of fields based on this canonical ordering to ensure only the intended fields get validated.
When you update your code to add an additional field that is intended to be signed, you can roll that out for new records and the record will continue to be self-describing:
- New records will have the additional field flagged as authenticated in
meta-auth
(andmeta-enc
will grow) - Old records will not, but your code will still sign them successfully
- To prevent downgrade attacks, simply include a schema version ID as an additional plaintext field that gets authenticated. An attacker who tries to downgrade will need to be able to produce a valid signature too.
You might think meta-auth
gives an attacker some advantage, but this only includes which fields are included in the security boundary of the signature or MAC, which allows unauthenticated data to be appended for whatever operational purpose without having to update signatures or expose signing keys to a wider part of the network.
{ "example": { "credit-card": { "number": /* encrypted */, "expiration": /* encrypted */, "ccv": /* encrypted */ }, "superfluous": { "rewards-member": null } }, "meta-auth": compress_bools([ true, /* example.credit-card.number */ true, /* example.credit-card.expiration */ true, /* example.credit-card.ccv */ false, /* example.superfluous.rewards-member */ true, /* meta-enc */ true /* meta-version */ ]), "meta-enc": compress_bools([ true, /* example.credit-card.number */ true, /* example.credit-card.expiration */ true, /* example.credit-card.ccv */ false, /* example.superfluous.rewards-member */ true /* meta-version */ ]), "meta-version": 0x01000000, "signature": /* -- snip -- */}
If an attacker tries to use the meta-auth
field to mess with a record, the best they can hope for is an Invalid Signature exception (assuming the signature algorithm is secure to begin with).
Even if they keep all of the fields the same, but play around with the structure of the record (e.g. changing the XPath or equivalent), so long as the path is authenticated with each field, breaking this is computationally infeasible.
Searchable Encryption
If you’ve managed to make it through the previous sections, congratulations, you now know enough to build a secure but completely useless database.
Art: CMYKat
Okay, put away the pitchforks; I will explain.
Part of the reason why we store data in a database, rather than a flat file, is because we want to do more than just read and write. Sometimes computer scientists want to compute. Almost always, you want to be able to query your database for a subset of records based on your specific business logic needs.
And so, a database which doesn’t do anything more than store ciphertext and maybe signatures is pretty useless to most people. You’d have better luck selling Monkey JPEGs to furries than convincing most businesses to part with their precious database-driven report generators.
Art: Sophie
So whenever one of your users wants to actually use their data, rather than just store it, they’re forced to decide between two mutually exclusive options:
- Encrypting the data, to protect it from unauthorized disclosure, but render it useless
- Doing anything useful with the data, but leaving it unencrypted in the database
This is especially annoying for business types that are all in on the Zero Trust buzzword.
Fortunately, the cryptographers are at it again, and boy howdy do they have a lot of solutions for this problem.
Order-{Preserving, Revealing} Encryption
On the fun side of things, you have things like Order-Preserving and Order-Revealing Encryption, which Matthew Green wrote about at length.
[D]atabase encryption has been a controversial subject in our field. I wish I could say that there’s been an actual debate, but it’s more that different researchers have fallen into different camps, and nobody has really had the data to make their position in a compelling way. There have actually been some very personal arguments made about it.Attack of the week: searchable encryption and the ever-expanding leakage function
The problem with these designs is that they have a significant enough leakage that it no longer provides semantic security.
From Grubbs, et al. (GLMP, 2019.)
Colors inverted to fit my blog’s theme better.
To put it in other words: These designs are only marginally better than ECB mode, and probably deserve their own poems too.
Order revealing
Reveals much more than order
Softcore ECBOrder preserving
Semantic security?
Only in your dreamsHaiku for your consideration
Deterministic Encryption
Here’s a simpler, but also terrible, idea for searchable encryption: Simply give up on semantic security entirely.
If you recall the AES_{De,En}crypt()
functions built into MySQL I mentioned at the start of this article, those are the most common form of deterministic encryption I’ve seen in use.
SELECT * FROM foo WHERE bar = AES_Encrypt('query', 'key');
However, there are slightly less bad variants. If you use AES-GCM-SIV with a static nonce, your ciphertexts are fully deterministic, and you can encrypt a small number of distinct records safely before you’re no longer secure.
From Page 14 of the linked paper. Full view.
That’s certainly better than nothing, but you also can’t mitigate confused deputy attacks. But we can do better than this.
Homomorphic Encryption
In a safer plane of academia, you’ll find homomorphic encryption, which researchers recently demonstrated with serving Wikipedia pages in a reasonable amount of time.
Homomorphic encryption allows computations over the ciphertext, which will be reflected in the plaintext, without ever revealing the key to the entity performing the computation.
If this sounds vaguely similar to the conditions that enable chosen-ciphertext attacks, you probably have a good intuition for how it works: RSA is homomorphic to multiplication, AES-CTR is homomorphic to XOR. Fully homomorphic encryption uses lattices, which enables multiple operations but carries a relatively enormous performance cost.
Art: Harubaki
Homomorphic encryption sometimes intersects with machine learning, because the notion of training an encrypted model by feeding it encrypted data, then decrypting it after-the-fact is desirable for certain business verticals. Your data scientists never see your data, and you have some plausible deniability about the final ML model this work produces. This is like a Siren song for Venture Capitalist-backed medical technology companies. Tech journalists love writing about it.
However, a less-explored use case is the ability to encrypt your programs but still get the correct behavior and outputs. Although this sounds like a DRM technology, it’s actually something that individuals could one day use to prevent their ISPs or cloud providers from knowing what software is being executed on the customer’s leased hardware. The potential for a privacy win here is certainly worth pondering, even if you’re a tried and true Pirate Party member.
Just say “NO” to the copyright cartels.
Art: CMYKat
Searchable Symmetric Encryption (SSE)
Forget about working at the level of fields and rows or individual records. What if we, instead, worked over collections of documents, where each document is viewed as a set of keywords from a keyword space?
Art: CMYKat
That’s the basic premise of SSE: Encrypting collections of documents rather than individual records.
The actual implementation details differ greatly between designs. They also differ greatly in their leakage profiles and susceptibility to side-channel attacks.
Some schemes use a so-called trapdoor permutation, such as RSA, as one of their building blocks.
Some schemes only allow for searching a static set of records, while others can accommodate new data over time (with the trade-off between more leakage or worse performance).
If you’re curious, you can learn more about SSE here, and see some open source SEE implementations online here.
You’re probably wondering, “If SSE is this well-studied and there are open source implementations available, why isn’t it more widely used?”
Your guess is as good as mine, but I can think of a few reasons:
- The protocols can be a little complicated to implement, and aren’t shipped by default in cryptography libraries (i.e. OpenSSL’s libcrypto or libsodium).
- Every known security risk in SSE is the product of a trade-offs, rather than there being a single winner for all use cases that developers can feel comfortable picking.
- Insufficient marketing and developer advocacy.
SSE schemes are mostly of interest to academics, although Seny Kamara (Brown Univeristy professior and one of the luminaries of searchable encryption) did try to develop an app called Pixek which used SSE to encrypt photos.
Maybe there’s room for a cryptography competition on searchable encryption schemes in the future.
You Can Have Little a HMAC, As a Treat
Finally, I can’t talk about searchable encryption without discussing a technique that’s older than dirt by Internet standards, that has been independently reinvented by countless software developers tasked with encrypting database records.
The oldest version I’ve been able to track down dates to 2006 by Raul Garcia at Microsoft, but I’m not confident that it didn’t exist before.
The idea I’m alluding to goes like this:
- Encrypt your data, securely, using symmetric cryptography.
(Hopefully your encryption addresses the considerations outlined in the relevant sections above.) - Separately, calculate an HMAC over the unencrypted data with a separate key used exclusively for indexing.
When you need to query your data, you can just recalculate the HMAC of your challenge and fetch the records that match it. Easy, right?
Even if you rotate your keys for encryption, you keep your indexing keys static across your entire data set. This lets you have durable indexes for encrypted data, which gives you the ability to do literal lookups for the performance hit of a hash function.
Additionally, everyone has HMAC in their toolkit, so you don’t have to move around implementations of complex cryptographic building blocks. You can live off the land. What’s not to love?
Hooray!
However, if you stopped here, we regret to inform you that your data is no longer indistinguishable from random, which probably undermines the security proof for your encryption scheme.
How annoying!
Of course, you don’t have to stop with the addition of plain HMAC to your database encryption software.
Take a page from Troy Hunt: Truncate the output to provide k-anonymity rather than a direct literal look-up.
“K-What Now?”
Imagine you have a full HMAC-SHA256 of the plaintext next to every ciphertext record with a static key, for searchability.
Each HMAC output corresponds 1:1 with a unique plaintext.
Because you’re using HMAC with a secret key, an attacker can’t just build a rainbow table like they would when attempting password cracking, but it still leaks duplicate plaintexts.
For example, an HMAC-SHA256 output might look like this: 04a74e4c0158e34a566785d1a5e1167c4e3455c42aea173104e48ca810a8b1ae
Art: CMYKat\
If you were to slice off most of those bytes (e.g. leaving only the last 3, which in the previous example yields a8b1ae
), then with sufficient records, multiple plaintexts will now map to the same truncated HMAC tag.
Which means if you’re only revealing a truncated HMAC tag to the database server (both when storing records or retrieving them), you can now expect false positives due to collisions in your truncated HMAC tag.
These false positives give your data a discrete set of anonymity (called k-anonymity), which means an attacker with access to your database cannot:
- Distinguish between two encrypted records with the same short HMAC tag.
- Reverse engineer the short HMAC tag into a single possible plaintext value, even if they can supply candidate queries and study the tags sent to the database.
Art: CMYKat\
As with SSE above, this short HMAC technique exposes a trade-off to users.
- Too much k-anonymity (i.e. too many false positives), and you will have to decrypt-then-discard multiple mismatching records. This can make queries slow.
- Not enough k-anonymity (i.e. insufficient false positives), and you’re no better off than a full HMAC.
Even more troublesome, the right amount to truncate is expressed in bits (not bytes), and calculating this value depends on the number of unique plaintext values you anticipate in your dataset. (Fortunately, it grows logarithmically, so you’ll rarely if ever have to tune this.)
If you’d like to play with this idea, here’s a quick and dirty demo script.
Intermission
If you started reading this post with any doubts about Cendyne’s statement that “Database cryptography is hard”, by making it to this point, they’ve probably been long since put to rest.
Art: Harubaki
Conversely, anyone that specializes in this topic is probably waiting for me to say anything novel or interesting; their patience wearing thin as I continue to rehash a surface-level introduction of their field without really diving deep into anything.
Thus, if you’ve read this far, I’d like to demonstrate the application of what I’ve covered thus far into a real-world case study into an database cryptography product.
Case Study: MongoDB Client-Side Encryption
MongoDB is an open source schema-free NoSQL database. Last year, MongoDB made waves when they announced Queryable Encryption in their upcoming client-side encryption release.
Taken from the press release, but adapted for dark themes.
A statement at the bottom of their press release indicates that this isn’t clown-shoes:
Queryable Encryption was designed by MongoDB’s Advanced Cryptography Research Group, headed by Seny Kamara and Tarik Moataz, who are pioneers in the field of encrypted search. The Group conducts cutting-edge peer-reviewed research in cryptography and works with MongoDB engineering teams to transfer and deploy the latest innovations in cryptography and privacy to the MongoDB data platform.
If you recall, I mentioned Seny Kamara in the SSE section of this post. They certainly aren’t wrong about Kamara and Moataz being pioneers in this field.
So with that in mind, let’s explore the implementation in libmongocrypt and see how it stands up to scrutiny.
MongoCrypt: The Good
MongoDB’s encryption library takes key management seriously: They provide a KMS integration for cloud users by default (supporting both AWS and Azure).
MongoDB uses Encrypt-then-MAC with AES-CBC and HMAC-SHA256, which is congruent to what Signal does for message encryption.
How Is Queryable Encryption Implemented?
From the current source code, we can see that MongoCrypt generates several different types of tokens, using HMAC (calculation defined here).
According to their press release:
The feature supports equality searches, with additional query types such as range, prefix, suffix, and substring planned for future releases.
Which means that most of the juicy details probably aren’t public yet.
These HMAC-derived tokens are stored wholesale in the data structure, but most are encrypted before storage using AES-CTR.
There are more layers of encryption (using AEAD), server-side token processing, and more AES-CTR-encrypted edge tokens. All of this is finally serialized (implementation) as one blob for storage.
Since only the equality operation is currently supported (which is the same feature you’d get from HMAC), it’s difficult to speculate what the full feature set looks like.
However, since Kamara and Moataz are leading its development, it’s likely that this feature set will be excellent.
MongoCrypt: The Bad
Every call to do_encrypt()
includes at most the Key ID (but typically NULL
) as the AAD. This means that the concerns over Confused Deputies (and NoSQL specifically) are relevant to MongoDB.
However, even if they did support authenticating the fully qualified path to a field in the AAD for their encryption, their AEAD construction is vulnerable to the kind of canonicalization attack I wrote about previously.
First, observe this code which assembles the multi-part inputs into HMAC.
/* Construct the input to the HMAC */uint32_t num_intermediates = 0;_mongocrypt_buffer_t intermediates[3];// -- snip --if (!_mongocrypt_buffer_concat ( &to_hmac, intermediates, num_intermediates)) { CLIENT_ERR ("failed to allocate buffer"); goto done;}if (hmac == HMAC_SHA_512_256) { uint8_t storage[64]; _mongocrypt_buffer_t tag = {.data = storage, .len = sizeof (storage)}; if (!_crypto_hmac_sha_512 (crypto, Km, &to_hmac, &tag, status)) { goto done; } // Truncate sha512 to first 256 bits. memcpy (out->data, tag.data, MONGOCRYPT_HMAC_LEN);} else { BSON_ASSERT (hmac == HMAC_SHA_256); if (!_mongocrypt_hmac_sha_256 (crypto, Km, &to_hmac, out, status)) { goto done; }}
The implementation of _mongocrypt_buffer_concat()
can be found here.
If either the implementation of that function, or the code I snipped from my excerpt, had contained code that prefixed every segment of the AAD with the length of the segment (represented as a uint64_t
to make overflow infeasible), then their AEAD mode would not be vulnerable to canonicalization issues.
Using TupleHash would also have prevented this issue.
Silver lining for MongoDB developers: Because the AAD is either a key ID or NULL, this isn’t exploitable in practice.
The first cryptographic flaw sort of cancels the second out.
If the libmongocrypt developers ever want to mitigate Confused Deputy attacks, they’ll need to address this canonicalization issue too.
MongoCrypt: The Ugly
MongoCrypt supports deterministic encryption.
If you specify deterministic encryption for a field, your application passes a deterministic initialization vector to AEAD.
We already discussed why this is bad above.
Wrapping Up
This was not a comprehensive treatment of the field of database cryptography. There are many areas of this field that I did not cover, nor do I feel qualified to discuss.
However, I hope anyone who takes the time to read this finds themselves more familiar with the subject.
Additionally, I hope any developers who think “encrypting data in a database is [easy, trivial] (select appropriate)” will find this broad introduction a humbling experience.
Art: CMYKat
https://soatok.blog/2023/03/01/database-cryptography-fur-the-rest-of-us/
#appliedCryptography #blockCipherModes #cryptography #databaseCryptography #databases #encryptedSearch #HMAC #MongoCrypt #MongoDB #QueryableEncryption #realWorldCryptography #security #SecurityGuidance #SQL #SSE #symmetricCryptography #symmetricSearchableEncryption
NIST opened public comments on SP 800-108 Rev. 1 (the NIST recommendations for Key Derivation Functions) last month. The main thing that’s changed from the original document published in 2009 is the inclusion of the Keccak-based KMAC alongside the incumbent algorithms.One of the recommendations of SP 800-108 is called “KDF in Counter Mode”. A related document, SP 800-56C, suggests using a specific algorithm called HKDF instead of the generic Counter Mode construction from SP 800-108–even though they both accomplish the same goal.
Isn’t standards compliance fun?
Interestingly, HKDF isn’t just an inconsistently NIST-recommended KDF, it’s also a common building block in a software developer’s toolkit which sees a lot of use in different protocols.
Unfortunately, the way HKDF is widely used is actually incorrect given its formal security definition. I’ll explain what I mean in a moment.
Art: Scruff
What is HKDF?
To first understand what HKDF is, you first need to know about HMAC.HMAC is a standard message authentication code (MAC) algorithm built with cryptographic hash functions (that’s the H). HMAC is specified in RFC 2104 (yes, it’s that old).
HKDF is a key-derivation function that uses HMAC under-the-hood. HKDF is commonly used in encryption tools (Signal, age). HKDF is specified in RFC 5869.
HKDF is used to derive a uniformly-random secret key, typically for use with symmetric cryptography algorithms. In any situation where a key might need to be derived, you might see HKDF being used. (Although, there may be better algorithms.)
Art: LvJ
How Developers Understand and Use HKDF
If you’re a software developer working with cryptography, you’ve probably seen an API in the crypto module for your programming language that looks like this, or maybe this.hash_hkdf( string $algo, string $key, int $length = 0, string $info = "", string $salt = ""): string
Software developers that work with cryptography will typically think of the HKDF parameters like so:
$algo
— which hash function to use$key
— the input key, from which multiple keys can be derived$length
— how many bytes to derive$info
— some arbitrary string used to bind a derived key to an intended context$salt
— some additional randomness (optional)The most common use-case of HKDF is to implement key-splitting, where a single input key (the Initial Keying Material, or IKM) is used to derive two or more independent keys, so that you’re never using a single key for multiple algorithms.
See also:
[url=https://github.com/defuse/php-encryption]defuse/php-encryption[/url]
, a popular PHP encryption library that does exactly what I just described.At a super high level, the HKDF usage I’m describing looks like this:
class MyEncryptor {protected function splitKeys(CryptographyKey $key, string $salt): array { $encryptKey = new CryptographyKey(hash_hkdf( 'sha256', $key->getRawBytes(), 32, 'encryption', $salt )); $authKey = new CryptographyKey(hash_hkdf( 'sha256', $key->getRawBytes(), 32, 'message authentication', $salt )); return [$encryptKey, $authKey];}public function encryptString(string $plaintext, CryptographyKey $key): string{ $salt = random_bytes(32); [$encryptKey, $hmacKey] = $this->splitKeys($key, $salt); // ... encryption logic here ... return base64_encode($salt . $ciphertext . $mac);}public function decryptString(string $encrypted, CryptographyKey $key): string{ $decoded = base64_decode($encrypted); $salt = mb_substr($decoded, 0, 32, '8bit'); [$encryptKey, $hmacKey] = $this->splitKeys($key, $salt); // ... decryption logic here ... return $plaintext;}// ... other method here ...}
Unfortunately, anyone who ever does something like this just violated one of the core assumptions of the HKDF security definition and no longer gets to claim “KDF security” for their construction. Instead, your protocol merely gets to claim “PRF security”.
Art: Harubaki
KDF? PRF? OMGWTFBBQ?
Let’s take a step back and look at some basic concepts.(If you want a more formal treatment, read this Stack Exchange answer.)
PRF: Pseudo-Random Functions
A pseudorandom function (PRF) is an efficient function that emulates a random oracle.“What the hell’s a random oracle?” you ask? Well, Thomas Pornin has the best explanation for random oracles:
A random oracle is described by the following model:
- There is a black box. In the box lives a gnome, with a big book and some dice.
- We can input some data into the box (an arbitrary sequence of bits).
- Given some input that he did not see beforehand, the gnome uses his dice to generate a new output, uniformly and randomly, in some conventional space (the space of oracle outputs). The gnome also writes down the input and the newly generated output in his book.
- If given an already seen input, the gnome uses his book to recover the output he returned the last time, and returns it again.
So a random oracle is like a kind of hash function, such that we know nothing about the output we could get for a given input message m. This is a useful tool for security proofs because they allow to express the attack effort in terms of number of invocations to the oracle.
The problem with random oracles is that it turns out to be very difficult to build a really “random” oracle. First, there is no proof that a random oracle can really exist without using a gnome. Then, we can look at what we have as candidates: hash functions. A secure hash function is meant to be resilient to collisions, preimages and second preimages. These properties do not imply that the function is a random oracle.
Thomas Pornin
Alternatively, Wikipedia has a more formal definition available to the academic-inclined.In practical terms, we can generate a strong PRF out of secure cryptographic hash functions by using a keyed construction; i.e. HMAC.
Thus, as long as your HMAC key is a secret, the output of HMAC can be generally treated as a PRF for all practical purposes. Your main security consideration (besides key management) is the collision risk if you truncate its output.
Art: LvJ
KDF: Key Derivation Functions
A key derivation function (KDF) is exactly what it says on the label: a cryptographic algorithm that derives one or more cryptographic keys from a secret input (which may be another cryptography key, a group element from a Diffie-Hellman key exchange, or a human-memorable password).Note that passwords should be used with a Password-Based Key Derivation Function, such as scrypt or Argon2id, not HKDF.
Despite what you may read online, KDFs do not need to be built upon cryptographic hash functions, specifically; but in practice, they often are.
A notable counter-example to this hash function assumption: CMAC in Counter Mode (from NIST SP 800-108) uses AES-CMAC, which is a variable-length input variant of CBC-MAC. CBC-MAC uses a block cipher, not a hash function.
Regardless of the construction, KDFs use a PRF under the hood, and the output of a KDF is supposed to be a uniformly random bit string.
Art: LvJ
PRF vs KDF Security Definitions
The security definition for a KDF has more relaxed requirements than PRFs: PRFs require the secret key be uniformly random. KDFs do not have this requirement.If you use a KDF with a non-uniformly random IKM, you probably need the KDF security definition.
If your IKM is already uniformly random (i.e. the “key separation” use case), you can get by with just a PRF security definition.
After all, the entire point of KDFs is to allow a congruent security level as you’d get from uniformly random secret keys, without also requiring them.
However, if you’re building a protocol with a security requirement satisfied by a KDF, but you actually implemented a PRF (i.e., not a KDF), this is a security vulnerability in your cryptographic design.
Art: LvJ
The HKDF Algorithm
HKDF is an HMAC-based KDF. Its algorithm consists of two distinct steps:
HKDF-Extract
uses the Initial Keying Material (IKM) and Salt to produce a Pseudo-Random Key (PRK).HKDF-Expand
actually derives the keys using PRK, theinfo
parameter, and a counter (from0
to255
) for each hash function output needed to generate the desired output length.If you’d like to see an implementation of this algorithm,
defuse/php-encryption
provides one (since it didn’t land in PHP until 7.1.0). Alternatively, there’s a Python implementation on Wikipedia that uses HMAC-SHA256.This detail about the two steps will matter a lot in just a moment.
Art: Swizz
How HKDF Salts Are Misused
The HKDF paper, written by Hugo Krawczyk, contains the following definition (page 7).The paper goes on to discuss the requirements for authenticating the salt over the communication channel, lest the attacker have the ability to influence it.
A subtle detail of this definition is that the security definition says that A salt value , not Multiple salt values.
Which means: You’re not supposed to use HKDF with a constant IKM, info label, etc. but vary the salt for multiple invocations. The salt must either be a fixed random value, or NULL.
The HKDF RFC makes this distinction even less clear when it argues for random salts.
We stress, however, that the use of salt adds significantly to the strength of HKDF, ensuring independence between different uses of the hash function, supporting “source-independent” extraction, and strengthening the analytical results that back the HKDF design.Random salt differs fundamentally from the initial keying material in two ways: it is non-secret and can be re-used. As such, salt values are available to many applications. For example, a pseudorandom number generator (PRNG) that continuously produces outputs by applying HKDF to renewable pools of entropy (e.g., sampled system events) can fix a salt value and use it for multiple applications of HKDF without having to protect the secrecy of the salt. In a different application domain, a key agreement protocol deriving cryptographic keys from a Diffie-Hellman exchange can derive a salt value from public nonces exchanged and authenticated between communicating parties as part of the key agreement (this is the approach taken in [IKEv2]).
RFC 5869, section 3.1
Okay, sure. Random salts are better than a NULL salt. And while this section alludes to “[fixing] a salt value” to “use it for multiple applications of HKDF without having to protect the secrecy of the salt”, it never explicitly states this requirement. Thus, the poor implementor is left to figure this out on their own.Thus, because it’s not using HKDF in accordance with its security definition, many implementations (such as the PHP encryption library we’ve been studying) do not get to claim that their construction has KDF security.
Instead, they only get to claim “Strong PRF” security, which you can get from just using HMAC.
Art: LvJ
What Purpose Do HKDF Salts Actually Serve?
Recall that the HKDF algorithm uses salts in the HDKF-Extract step. Salts in this context were intended for deriving keys from a Diffie-Hellman output, or a human-memorable password.In the case of [Elliptic Curve] Diffie-Hellman outputs, the result of the key exchange algorithm is a random group element, but not necessarily uniformly random bit string. There’s some structure to the output of these functions. This is why you always, at minimum, apply a cryptographic hash function to the output of [EC]DH before using it as a symmetric key.
HKDF uses salts as a mechanism to improve the quality of randomness when working with group elements and passwords.
Extending the nonce for a symmetric-key AEAD mode is a good idea, but using HKDF’s salt parameter specifically to accomplish this is a misuse of its intended function, and produces a weaker argument for your protocol’s security than would otherwise be possible.
How Should You Introduce Randomness into HKDF?
Just shove it in theinfo
parameter.Art: LvJ
It may seem weird, and defy intuition, but the correct way to introduce randomness into HKDF as most developers interact with the algorithm is to skip the salt parameter entirely (either fixing it to a specific value for domain-separation or leaving it NULL), and instead concatenate data into the
info
parameter.class BetterEncryptor extends MyEncryptor {protected function splitKeys(CryptographyKey $key, string $salt): array { $encryptKey = new CryptographyKey(hash_hkdf( 'sha256', $key->getRawBytes(), 32, $salt . 'encryption', '' // intentionally empty )); $authKey = new CryptographyKey(hash_hkdf( 'sha256', $key->getRawBytes(), 32, $salt . 'message authentication', '' // intentionally empty )); return [$encryptKey, $authKey];}}
Of course, you still have to watch out for canonicalization attacks if you’re feeding multi-part messages into the info tag.
Another advantage: This also lets you optimize your HKDF calls by caching the PRK from the
HKDF-Extract
step and reuse it for multiple invocations ofHKDF-Expand
with a distinctinfo
. This allows you to reduce the number of hash function invocations from to (since each HMAC involves two hash function invocations).Notably, this HKDF salt usage was one of the things that was changed in V3/V4 of PASETO.
Does This Distinction Really Matter?
If it matters, your cryptographer will tell you it matters–which probably means they have a security proof that assumes the KDF security definition for a very good reason, and you’re not allowed to violate that assumption.Otherwise, probably not. Strong PRF security is still pretty damn good for most threat models.
Art: LvJ
Closing Thoughts
If your takeaway was, “Wow, I feel stupid,” don’t, because you’re in good company.I’ve encountered several designs in my professional life that shoved the randomness into the
info
parameter, and it perplexed me because there was a perfectly good salt parameter right there. It turned out, I was wrong to believe that, for all of the subtle and previously poorly documented reasons discussed above. But now we both know, and we’re all better off for it.So don’t feel dumb for not knowing. I didn’t either, until this was pointed out to me by a very patient colleague.
“Feeling like you were stupid” just means you learned.
(Art: LvJ)Also, someone should really get NIST to be consistent about whether you should use HKDF or “KDF in Counter Mode with HMAC” as a PRF, because SP 800-108’s new revision doesn’t concede this point at all (presumably a relic from the 2009 draft).
This concession was made separately in 2011 with SP 800-56C revision 1 (presumably in response to criticism from the 2010 HKDF paper), and the present inconsistency is somewhat vexing.
(On that note, does anyone actually use the NIST 800-108 KDFs instead of HKDF? If so, why? Please don’t say you need CMAC…)
Bonus Content
These questions were asked after this blog post initially went public, and I thought they were worth adding. If you ask a good question, it may end up being edited in at the end, too.Art: LvJ
Why Does HKDF use the Salt as the HMAC key in the Extract Step? (via r/crypto)
Broadly speaking, when applying a PRF to two “keys”, you get to decide which one you treat as the “key” in the underlying API.HMAC’s API is HMACalg(key, message), but how HKDF uses it might as well be HMACalg(key1, key2).
The difference here seems almost arbitrary, but there’s a catch.
HKDF was designed for Diffie-Hellman outputs (before ECDH was the norm), which are generally able to be much larger than the block size of the underlying hash function. 2048-bit DH results fit in 256 bytes, which is 4 times the SHA256 block size.
If you have to make a decision, using the longer input (DH output) as the message is more intuitive for analysis than using it as the key, due to pre-hashing. I’ve discussed the counter-intuitive nature of HMAC’s pre-hashing behavior at length in this post, if you’re interested.
So with ECDH, it literally doesn’t matter which one was used (unless you have a weird mismatch in hash functions and ECC groups; i.e. NIST P-521 with SHA-224).
But before the era of ECDH, it was important to use the salt as the HMAC key in the extract step, since they were necessarily smaller than a DH group element.
Thus, HKDF chose HMACalg(salt, IKM) instead of HMACalg(IKM, salt) for the calculation of PRK in the HKDF-Extract step.
Neil Madden also adds that the reverse would create a chicken-egg situation, but I personally suspect that the pre-hashing would be more harmful to the security analysis than merely supplying a non-uniformly random bit string as an HMAC key in this specific context.
My reason for believing this is, when a salt isn’t supplied, it defaults to a string of
0x00
bytes as long as the output size of the underlying hash function. If the uniform randomness of the salt mattered that much, this wouldn’t be a tolerable condition.https://soatok.blog/2021/11/17/understanding-hkdf/
#cryptographicHashFunction #cryptography #hashFunction #HMAC #KDF #keyDerivationFunction #securityDefinition #SecurityGuidance
NIST opened public comments on SP 800-108 Rev. 1 (the NIST recommendations for Key Derivation Functions) last month. The main thing that’s changed from the original document published in 2009 is the inclusion of the Keccak-based KMAC alongside the incumbent algorithms.
One of the recommendations of SP 800-108 is called “KDF in Counter Mode”. A related document, SP 800-56C, suggests using a specific algorithm called HKDF instead of the generic Counter Mode construction from SP 800-108–even though they both accomplish the same goal.
Isn’t standards compliance fun?
Interestingly, HKDF isn’t just an inconsistently NIST-recommended KDF, it’s also a common building block in a software developer’s toolkit which sees a lot of use in different protocols.
Unfortunately, the way HKDF is widely used is actually incorrect given its formal security definition. I’ll explain what I mean in a moment.
Art: Scruff
What is HKDF?
To first understand what HKDF is, you first need to know about HMAC.
HMAC is a standard message authentication code (MAC) algorithm built with cryptographic hash functions (that’s the H). HMAC is specified in RFC 2104 (yes, it’s that old).
HKDF is a key-derivation function that uses HMAC under-the-hood. HKDF is commonly used in encryption tools (Signal, age). HKDF is specified in RFC 5869.
HKDF is used to derive a uniformly-random secret key, typically for use with symmetric cryptography algorithms. In any situation where a key might need to be derived, you might see HKDF being used. (Although, there may be better algorithms.)
Art: LvJ
How Developers Understand and Use HKDF
If you’re a software developer working with cryptography, you’ve probably seen an API in the crypto module for your programming language that looks like this, or maybe this.
hash_hkdf( string $algo, string $key, int $length = 0, string $info = "", string $salt = ""): string
Software developers that work with cryptography will typically think of the HKDF parameters like so:
$algo
— which hash function to use$key
— the input key, from which multiple keys can be derived$length
— how many bytes to derive$info
— some arbitrary string used to bind a derived key to an intended context$salt
— some additional randomness (optional)
The most common use-case of HKDF is to implement key-splitting, where a single input key (the Initial Keying Material, or IKM) is used to derive two or more independent keys, so that you’re never using a single key for multiple algorithms.
See also: [url=https://github.com/defuse/php-encryption]defuse/php-encryption[/url]
, a popular PHP encryption library that does exactly what I just described.
At a super high level, the HKDF usage I’m describing looks like this:
class MyEncryptor {protected function splitKeys(CryptographyKey $key, string $salt): array { $encryptKey = new CryptographyKey(hash_hkdf( 'sha256', $key->getRawBytes(), 32, 'encryption', $salt )); $authKey = new CryptographyKey(hash_hkdf( 'sha256', $key->getRawBytes(), 32, 'message authentication', $salt )); return [$encryptKey, $authKey];}public function encryptString(string $plaintext, CryptographyKey $key): string{ $salt = random_bytes(32); [$encryptKey, $hmacKey] = $this->splitKeys($key, $salt); // ... encryption logic here ... return base64_encode($salt . $ciphertext . $mac);}public function decryptString(string $encrypted, CryptographyKey $key): string{ $decoded = base64_decode($encrypted); $salt = mb_substr($decoded, 0, 32, '8bit'); [$encryptKey, $hmacKey] = $this->splitKeys($key, $salt); // ... decryption logic here ... return $plaintext;}// ... other method here ...}
Unfortunately, anyone who ever does something like this just violated one of the core assumptions of the HKDF security definition and no longer gets to claim “KDF security” for their construction. Instead, your protocol merely gets to claim “PRF security”.
Art: Harubaki
KDF? PRF? OMGWTFBBQ?
Let’s take a step back and look at some basic concepts.
(If you want a more formal treatment, read this Stack Exchange answer.)
PRF: Pseudo-Random Functions
A pseudorandom function (PRF) is an efficient function that emulates a random oracle.
“What the hell’s a random oracle?” you ask? Well, Thomas Pornin has the best explanation for random oracles:
A random oracle is described by the following model:
- There is a black box. In the box lives a gnome, with a big book and some dice.
- We can input some data into the box (an arbitrary sequence of bits).
- Given some input that he did not see beforehand, the gnome uses his dice to generate a new output, uniformly and randomly, in some conventional space (the space of oracle outputs). The gnome also writes down the input and the newly generated output in his book.
- If given an already seen input, the gnome uses his book to recover the output he returned the last time, and returns it again.
So a random oracle is like a kind of hash function, such that we know nothing about the output we could get for a given input message m. This is a useful tool for security proofs because they allow to express the attack effort in terms of number of invocations to the oracle.
The problem with random oracles is that it turns out to be very difficult to build a really “random” oracle. First, there is no proof that a random oracle can really exist without using a gnome. Then, we can look at what we have as candidates: hash functions. A secure hash function is meant to be resilient to collisions, preimages and second preimages. These properties do not imply that the function is a random oracle.
Thomas Pornin
Alternatively, Wikipedia has a more formal definition available to the academic-inclined.
In practical terms, we can generate a strong PRF out of secure cryptographic hash functions by using a keyed construction; i.e. HMAC.
Thus, as long as your HMAC key is a secret, the output of HMAC can be generally treated as a PRF for all practical purposes. Your main security consideration (besides key management) is the collision risk if you truncate its output.
Art: LvJ
KDF: Key Derivation Functions
A key derivation function (KDF) is exactly what it says on the label: a cryptographic algorithm that derives one or more cryptographic keys from a secret input (which may be another cryptography key, a group element from a Diffie-Hellman key exchange, or a human-memorable password).
Note that passwords should be used with a Password-Based Key Derivation Function, such as scrypt or Argon2id, not HKDF.
Despite what you may read online, KDFs do not need to be built upon cryptographic hash functions, specifically; but in practice, they often are.
A notable counter-example to this hash function assumption: CMAC in Counter Mode (from NIST SP 800-108) uses AES-CMAC, which is a variable-length input variant of CBC-MAC. CBC-MAC uses a block cipher, not a hash function.
Regardless of the construction, KDFs use a PRF under the hood, and the output of a KDF is supposed to be a uniformly random bit string.
Art: LvJ
PRF vs KDF Security Definitions
The security definition for a KDF has more relaxed requirements than PRFs: PRFs require the secret key be uniformly random. KDFs do not have this requirement.
If you use a KDF with a non-uniformly random IKM, you probably need the KDF security definition.
If your IKM is already uniformly random (i.e. the “key separation” use case), you can get by with just a PRF security definition.
After all, the entire point of KDFs is to allow a congruent security level as you’d get from uniformly random secret keys, without also requiring them.
However, if you’re building a protocol with a security requirement satisfied by a KDF, but you actually implemented a PRF (i.e., not a KDF), this is a security vulnerability in your cryptographic design.
Art: LvJ
The HKDF Algorithm
HKDF is an HMAC-based KDF. Its algorithm consists of two distinct steps:
HKDF-Extract
uses the Initial Keying Material (IKM) and Salt to produce a Pseudo-Random Key (PRK).HKDF-Expand
actually derives the keys using PRK, theinfo
parameter, and a counter (from0
to255
) for each hash function output needed to generate the desired output length.
If you’d like to see an implementation of this algorithm, defuse/php-encryption
provides one (since it didn’t land in PHP until 7.1.0). Alternatively, there’s a Python implementation on Wikipedia that uses HMAC-SHA256.
This detail about the two steps will matter a lot in just a moment.
Art: Swizz
How HKDF Salts Are Misused
The HKDF paper, written by Hugo Krawczyk, contains the following definition (page 7).
The paper goes on to discuss the requirements for authenticating the salt over the communication channel, lest the attacker have the ability to influence it.
A subtle detail of this definition is that the security definition says that A salt value , not Multiple salt values.
Which means: You’re not supposed to use HKDF with a constant IKM, info label, etc. but vary the salt for multiple invocations. The salt must either be a fixed random value, or NULL.
The HKDF RFC makes this distinction even less clear when it argues for random salts.
We stress, however, that the use of salt adds significantly to the strength of HKDF, ensuring independence between different uses of the hash function, supporting “source-independent” extraction, and strengthening the analytical results that back the HKDF design.Random salt differs fundamentally from the initial keying material in two ways: it is non-secret and can be re-used. As such, salt values are available to many applications. For example, a pseudorandom number generator (PRNG) that continuously produces outputs by applying HKDF to renewable pools of entropy (e.g., sampled system events) can fix a salt value and use it for multiple applications of HKDF without having to protect the secrecy of the salt. In a different application domain, a key agreement protocol deriving cryptographic keys from a Diffie-Hellman exchange can derive a salt value from public nonces exchanged and authenticated between communicating parties as part of the key agreement (this is the approach taken in [IKEv2]).
RFC 5869, section 3.1
Okay, sure. Random salts are better than a NULL salt. And while this section alludes to “[fixing] a salt value” to “use it for multiple applications of HKDF without having to protect the secrecy of the salt”, it never explicitly states this requirement. Thus, the poor implementor is left to figure this out on their own.
Thus, because it’s not using HKDF in accordance with its security definition, many implementations (such as the PHP encryption library we’ve been studying) do not get to claim that their construction has KDF security.
Instead, they only get to claim “Strong PRF” security, which you can get from just using HMAC.
Art: LvJ
What Purpose Do HKDF Salts Actually Serve?
Recall that the HKDF algorithm uses salts in the HDKF-Extract step. Salts in this context were intended for deriving keys from a Diffie-Hellman output, or a human-memorable password.
In the case of [Elliptic Curve] Diffie-Hellman outputs, the result of the key exchange algorithm is a random group element, but not necessarily uniformly random bit string. There’s some structure to the output of these functions. This is why you always, at minimum, apply a cryptographic hash function to the output of [EC]DH before using it as a symmetric key.
HKDF uses salts as a mechanism to improve the quality of randomness when working with group elements and passwords.
Extending the nonce for a symmetric-key AEAD mode is a good idea, but using HKDF’s salt parameter specifically to accomplish this is a misuse of its intended function, and produces a weaker argument for your protocol’s security than would otherwise be possible.
How Should You Introduce Randomness into HKDF?
Just shove it in the info
parameter.
Art: LvJ
It may seem weird, and defy intuition, but the correct way to introduce randomness into HKDF as most developers interact with the algorithm is to skip the salt parameter entirely (either fixing it to a specific value for domain-separation or leaving it NULL), and instead concatenate data into the info
parameter.
class BetterEncryptor extends MyEncryptor {protected function splitKeys(CryptographyKey $key, string $salt): array { $encryptKey = new CryptographyKey(hash_hkdf( 'sha256', $key->getRawBytes(), 32, $salt . 'encryption', '' // intentionally empty )); $authKey = new CryptographyKey(hash_hkdf( 'sha256', $key->getRawBytes(), 32, $salt . 'message authentication', '' // intentionally empty )); return [$encryptKey, $authKey];}}
Of course, you still have to watch out for canonicalization attacks if you’re feeding multi-part messages into the info tag.
Another advantage: This also lets you optimize your HKDF calls by caching the PRK from the HKDF-Extract
step and reuse it for multiple invocations of HKDF-Expand
with a distinct info
. This allows you to reduce the number of hash function invocations from to (since each HMAC involves two hash function invocations).
Notably, this HKDF salt usage was one of the things that was changed in V3/V4 of PASETO.
Does This Distinction Really Matter?
If it matters, your cryptographer will tell you it matters–which probably means they have a security proof that assumes the KDF security definition for a very good reason, and you’re not allowed to violate that assumption.
Otherwise, probably not. Strong PRF security is still pretty damn good for most threat models.
Art: LvJ
Closing Thoughts
If your takeaway was, “Wow, I feel stupid,” don’t, because you’re in good company.
I’ve encountered several designs in my professional life that shoved the randomness into the info
parameter, and it perplexed me because there was a perfectly good salt parameter right there. It turned out, I was wrong to believe that, for all of the subtle and previously poorly documented reasons discussed above. But now we both know, and we’re all better off for it.
So don’t feel dumb for not knowing. I didn’t either, until this was pointed out to me by a very patient colleague.
“Feeling like you were stupid” just means you learned.
(Art: LvJ)
Also, someone should really get NIST to be consistent about whether you should use HKDF or “KDF in Counter Mode with HMAC” as a PRF, because SP 800-108’s new revision doesn’t concede this point at all (presumably a relic from the 2009 draft).
This concession was made separately in 2011 with SP 800-56C revision 1 (presumably in response to criticism from the 2010 HKDF paper), and the present inconsistency is somewhat vexing.
(On that note, does anyone actually use the NIST 800-108 KDFs instead of HKDF? If so, why? Please don’t say you need CMAC…)
Bonus Content
These questions were asked after this blog post initially went public, and I thought they were worth adding. If you ask a good question, it may end up being edited in at the end, too.
Art: LvJ
Why Does HKDF use the Salt as the HMAC key in the Extract Step? (via r/crypto)
Broadly speaking, when applying a PRF to two “keys”, you get to decide which one you treat as the “key” in the underlying API.
HMAC’s API is HMACalg(key, message), but how HKDF uses it might as well be HMACalg(key1, key2).
The difference here seems almost arbitrary, but there’s a catch.
HKDF was designed for Diffie-Hellman outputs (before ECDH was the norm), which are generally able to be much larger than the block size of the underlying hash function. 2048-bit DH results fit in 256 bytes, which is 4 times the SHA256 block size.
If you have to make a decision, using the longer input (DH output) as the message is more intuitive for analysis than using it as the key, due to pre-hashing. I’ve discussed the counter-intuitive nature of HMAC’s pre-hashing behavior at length in this post, if you’re interested.
So with ECDH, it literally doesn’t matter which one was used (unless you have a weird mismatch in hash functions and ECC groups; i.e. NIST P-521 with SHA-224).
But before the era of ECDH, it was important to use the salt as the HMAC key in the extract step, since they were necessarily smaller than a DH group element.
Thus, HKDF chose HMACalg(salt, IKM) instead of HMACalg(IKM, salt) for the calculation of PRK in the HKDF-Extract step.
Neil Madden also adds that the reverse would create a chicken-egg situation, but I personally suspect that the pre-hashing would be more harmful to the security analysis than merely supplying a non-uniformly random bit string as an HMAC key in this specific context.
My reason for believing this is, when a salt isn’t supplied, it defaults to a string of 0x00
bytes as long as the output size of the underlying hash function. If the uniform randomness of the salt mattered that much, this wouldn’t be a tolerable condition.
https://soatok.blog/2021/11/17/understanding-hkdf/
#cryptographicHashFunction #cryptography #hashFunction #HMAC #KDF #keyDerivationFunction #securityDefinition #SecurityGuidance
As we look upon the sunset of a remarkably tiresome year, I thought it would be appropriate to talk about cryptographic wear-out.What is cryptographic wear-out?
It’s the threshold when you’ve used the same key to encrypt so much data that you should probably switch to a new key before you encrypt any more. Otherwise, you might let someone capable of observing all your encrypted data perform interesting attacks that compromise the security of the data you’ve encrypted.My definitions always aim to be more understandable than pedantically correct.
(Art by Swizz)The exact value of the threshold varies depending on how exactly you’re encrypting data (n.b. AEAD modes, block ciphers + cipher modes, etc. each have different wear-out thresholds due to their composition).
Let’s take a look at the wear-out limits of the more popular symmetric encryption methods, and calculate those limits ourselves.
Specific Ciphers and Modes
(Art by Khia. Poorly edited by the author.)Cryptographic Limits for AES-GCM
I’ve written about AES-GCM before (and why I think it sucks).AES-GCM is a construction that combines AES-CTR with an authenticator called GMAC, whose consumption of nonces looks something like this:
- Calculating H (used in GHASH for all messages encrypted under the same key, regardless of nonce):
Encrypt(00000000 00000000 00000000 00000000)
- Calculating J0 (the pre-counter block):
- If the nonce is 96 bits long:
NNNNNNNN NNNNNNNN NNNNNNNN 00000001
where theN
spaces represent the nonce hexits.
- Otherwise:
s = 128 * ceil(len(nonce)/nonce) - len(nonce)
J0 = GHASH(H, nonce || zero(s+64) || int2bytes(len(nonce))
- Each block of data encrypted uses J0 + block counter (starting at 1) as a CTR nonce.
- J0 is additionally used as the nonce to calculate the final GMAC tag.
AES-GCM is one of the algorithms where it’s easy to separately calculate the safety limits per message (i.e. for a given nonce and key), as well as for all messages under a key.
AES-GCM Single Message Length Limits
In the simplest case (nonce is 96 bits), you end up with the following nonces consumed:
- For each key:
00000000 00000000 00000000 00000000
- For each (nonce, key) pair:
NNNNNNNN NNNNNNNN NNNNNNNN 000000001
for J0NNNNNNNN NNNNNNNN NNNNNNNN 000000002
for encrypting the first 16 bytes of plaintextNNNNNNNN NNNNNNNN NNNNNNNN 000000003
for the next 16 bytes of plaintext…- …
NNNNNNNN NNNNNNNN NNNNNNNN FFFFFFFFF
for the final 16 bytes of plaintext.
From here, it’s pretty easy to see that you can encrypt the blocks from
00000002
toFFFFFFFF
without overflowing and creating a nonce reuse. This means that each (key, nonce) can be used to encrypt a single message up to blocks of the underlying ciphertext.Since the block size of AES is 16 bytes, this means the maximum length of a single AES-GCM (key, nonce) pair is bytes (or 68,719,476,480 bytes). This is approximately 68 GB or 64 GiB.
Things get a bit tricker to analyze when the nonce is not 96 bits, since it’s hashed.
The disadvantage of this hashing behavior is that it’s possible for two different nonces to produce overlapping ranges of AES-CTR output, which makes the security analysis very difficult.
However, this hashed output is also hidden from network observers since they do not know the value of H. Without some method of reliably detecting when you have an overlapping range of hidden block counters, you can’t exploit this.
(If you want to live dangerously and motivate cryptanalysis research, mix 96-bit and non-96-bit nonces with the same key in a system that does something valuable.)
Multi-Message AES-GCM Key Wear-Out
Now that we’ve established the maximum length for a single message, how many messages you can safely encrypt under a given AES-GCM key depends entirely on how your nonce is selected.If you have a reliable counter, which is guaranteed to never repeat, and start it at 0 you can theoretically encrypt messages safely. Hooray!
Hooray!
(Art by Swizz)You probably don’t have a reliable counter, especially in real-world settings (distributed systems, multi-threaded applications, virtual machines that might be snapshotted and restored, etc.).
Confound you, technical limitations!
(Art by Swizz)Additionally (thanks to 2adic for the expedient correction), you cannot safely encrypt more than blocks with AES because the keystream blocks–as the output of a block cipher–cannot repeat.
Most systems that cannot guarantee unique incrementing nonces simply generate nonces with a cryptographically secure random number generator. This is a good idea, but no matter how high quality your random number generator is, random functions will produce collisions with a discrete probability.
If you have possible values, you should expect a single collision(with 50% probability) after (or )samples. This is called the birthday bound.
However, 50% of a nonce reuse isn’t exactly a comfortable safety threshold for most systems (especially since nonce reuse will cause AES-GCM to become vulnerable to active attackers). 1 in 4 billion is a much more comfortable safety margin against nonce reuse via collisions than 1 in 2. Fortunately, you can calculate the discrete probability of a birthday collision pretty easily.
If you want to rekey after your collision probability exceeds (for a random nonce between 0 and ), you simply need to re-key after messages.
AES-GCM Safety Limits
- Maximum message length: bytes
- Maximum number of messages (random nonce):
- Maximum number of messages (sequential nonce): (but you probably don’t have this luxury in the real world)
- Maximum data safely encrypted under a single key with a random nonce: about bytes
Not bad, but we can do better.
(Art by Khia.)Cryptographic Limits for ChaCha20-Poly1305
The IETF version of ChaCha20-Poly1305 uses 96-bit nonces and 32-bit internal counters. A similar analysis follows from AES-GCM’s, with a few notable exceptions.For starters, the one-time Poly1305 key is derived from the first 32 bytes of the ChaCha20 keystream output (block 0) for a given (nonce, key) pair. There is no equivalent to AES-GCM’s H parameter which is static for each key. (The ChaCha20 encryption begins using block 1.)
Additionally, each block for ChaCha20 is 512 bits, unlike AES’s 128 bits. So the message limit here is a little more forgiving.
Since the block size is 512 bits (or 64 bytes), and only one block is consumed for Poly1305 key derivation, we can calculate a message length limit of , or 274,877,906,880 bytes–nearly 256 GiB for each (nonce, key) pair.
The same rules for handling 96-bit nonces applies as with AES-GCM, so we can carry that value forward.
ChaCha20-Poly1305 Safety Limits
- Maximum message length: bytes
- Maximum number of messages (random nonce):
- Maximum number of messages (sequential nonce): (but you probably don’t have this luxury in the real world)
- Maximum data safely encrypted under a single key with a random nonce: about bytes
A significant improvement, but still practically limited.
(Art by Khia.)Cryptographic Limits for XChaCha20-Poly1305
XChaCha20-Poly1305 is a variant of XSalsa20-Poly1305 (as used in libsodium) and the IETF’s ChaCha20-Poly1305 construction. It features 192-bit nonces and 32-bit internal counters.XChaCha20-Poly1305 is instantiated by using HChaCha20 of the key over the first 128 bits of the nonce to produce a subkey, which is used with the remaining nonce bits using the aforementioned ChaCha20-Poly1305.
This doesn’t change the maximum message length,but it does change the number of messages you can safely encrypt (since you’re actually using up to distinct keys).Thus, even if you manage to repeat the final ChaCha20-Poly1305 nonce, as long as the total nonce differs, each encryptions will be performed with a distinct key (thanks to the HChaCha20 key derivation; see the XSalsa20 paper and IETF RFC draft for details).
UPDATE (2021-04-15): It turns out, my read of the libsodium implementation was erroneous due to endian-ness. The maximum message length for XChaCha20-Poly1305 is blocks, and for AEAD_XChaCha20_Poly1305 is blocks. Each block is 64 bytes, so that changes the maximum message length to about . This doesn’t change the extended-nonce details, just the underlying ChaCha usage.
XChaCha20-Poly1305 Safety Limits
- Maximum message length: bytes (earlier version of this document said
)- Maximum number of messages (random nonce):
- Maximum number of messages (sequential nonce): (but you probably don’t have this luxury in the real world)
- Maximum data safely encrypted under a single key with a random nonce: about bytes
I can
seeencrypt forever, man.
(Art by Khia.)Cryptographic Limits for AES-CBC
It’s tempting to compare non-AEAD constructions and block cipher modes such as CBC (Cipher Block Chaining), but they’re totally different monsters.
- AEAD ciphers have a clean delineation between message length limit and the message quantity limit
- CBC and other cipher modes do not have this separation
Every time you encrypt a block with AES-CBC, you are depleting from a universal bucket that affects the birthday bound security of encrypting more messages under that key. (And unlike AES-GCM with long nonces, AES-CBC’s IV is public.)
This is in addition to the operational requirements of AES-CBC (plaintext padding, initialization vectors that never repeat and must be unpredictable, separate message authentication since CBC doesn’t provide integrity and is vulnerable to chosen-ciphertext atacks, etc.).
My canned response to most queries about AES-CBC.
(Art by Khia.)For this reason, most cryptographers don’t even bother calculating the safety limit for AES-CBC in the same breath as discussing AES-GCM. And they’re right to do so!
If you find yourself using AES-CBC (or AES-CTR, for that matter), you’d best be performing a separate HMAC-SHA256 over the ciphertext (and verifying this HMAC with a secure comparison function before decrypting). Additionally, you should consider using an extended nonce construction to split one-time encryption and authentication keys.
(Art by Riley.)
However, for the sake of completeness, let’s figure out what our practical limits are.
CBC operates on entire blocks of plaintext, whether you need the entire block or not.
On encryption, the output of the previous block is mixed (using XOR) with the current block, then encrypted with the block cipher. For the first block, the IV is used in the place of a “previous” block. (Hence, its requirements to be non-repeating and unpredictable.)
This means you can informally model (IV xor PlaintextBlock) and (PBn xor PBn+1) as a pseudo-random function, before it’s encrypted with the block cipher.
If those words don’t mean anything to you, here’s the kicker: You can use the above discussion about birthday bounds to calculate the upper safety bounds for the total number of blocks encrypted under a single AES key (assuming IVs are generated from a secure random source).
If you’re okay with a 50% probability of a collision, you should re-key after blocks have been encrypted.
https://www.youtube.com/watch?v=v0IsYNDMV7A
If your safety margin is closer to the 1 in 4 billion (as with AES-GCM), you want to rekey after blocks.
However, blocks encrypted doesn’t map neatly to bytes encrypted.
If your plaintext is always an even multiple of 128 bits (or 16 bytes), this allows for up to bytes of plaintext. If you’re using PKCS#7 padding, keep in mind that this will include an entire padding block per message, so your safety margin will deplete a bit faster (depending on how many individual messages you encrypt, and therefore how many padding blocks you need).
On the other extreme (1-byte plaintexts), you’ll only be able to eek encrypted bytes before you should re-key.
Therefore, to stay within the safety margin of AES-CBC, you SHOULD re-key after blocks (including padding) have been encrypted.
Keep in mind: single-byte blocks is still approximately 281 TiB of data (including padding). On the upper end, 15-byte blocks (with 1-byte padding to stay within a block) clocks in at about or about 4.22 PiB of data.
That’s Blocks. What About Bytes?
The actual plaintext byte limit sans padding is a bit fuzzy and context-dependent.The local extrema occurs if your plaintext is always 16 bytes (and thus requires an extra 16 bytes of padding). Any less, and the padding fits within one block. Any more, and the data😛adding ratio starts to dominate.
Therefore, the worst case scenario with padding is that you take the above safety limit for block counts, and cut it in half. Cutting a number in half means reducing the exponent by 1.
But this still doesn’t eliminate the variance. blocks could be anywhere from to bytes of real plaintext. When in situations like this, we have to assume the worst (n.b. take the most conservative value).
Therefore…
AES-CBC Safety Limits
- Maximum data safely encrypted under a single key with a random nonce: bytes (approximately 141 TiB)
Yet another reason to dislike non-AEAD ciphers.
(Art by Khia.)Take-Away
Compared to AES-CBC, AES-GCM gives you approximately a million times as much usage out of the same key, for the same threat profile.ChaCha20-Poly1305 and XChaCha20-Poly1305 provides even greater allowances of encrypting data under the same key. The latter is even safe to use to encrypt arbitrarily large volumes of data under a single key without having to worry about ever practically hitting the birthday bound.
I’m aware that this blog post could have simply been a comparison table and a few footnotes (or even an IETF RFC draft), but I thought it would be more fun to explain how these values are derived from the cipher constructions.
(Art by Khia.)
https://soatok.blog/2020/12/24/cryptographic-wear-out-for-symmetric-encryption/
#AES #AESCBC #AESGCM #birthdayAttack #birthdayBound #cryptography #safetyMargin #SecurityGuidance #symmetricCryptography #symmetricEncryption #wearOut
Imagine you’re a software developer, and you need to authenticate users based on a username and password.
If you’re well-read on the industry standard best practices, you’ll probably elect to use something like bcrypt, scrypt, Argon2id, or PBKDF2. (If you thought to use something else, you’re almost certainly doing it wrong.)
Let’s say, due to technical constraints, that you’re forced to select PBKDF2 out of the above options, but you’re able to use a suitable hash function (i.e. SHA-256 or SHA-512, instead of SHA-1).
Every password is hashed with a unique 256-bit salt, generated from a secure random generator, and has an iteration count in excess of 80,000 (for SHA-512) or 100,000 (for SHA-256). So far, so good.
(Art by Khia.)
Let’s also say your users’ passwords are long, randomly-generated strings. This can mean either you’re building a community of password manager aficionados, or your users are actually software bots with API keys.
Every user’s password is, therefore, always a 256 character string of random hexadecimal characters.
Two sprints before your big launch, your boss calls you and says, “Due to arcane legal compliance reasons, we need to be able to detect duplicate passwords. It needs to be fast.”
You might might think, “That sounds like a hard problem to solve.”
Your boss insists, “The passwords are all too long and too high-entropy to reliably attacked, even when we’re using a fast hash, so we should be able to get by with a simple hash function for deduplication purposes.”
So you change your code to look like this:
<?phpuse ParagonIE\EasyDB\EasyDB;use MyApp\Security\PasswordHash;class Authentication { private EasyDB $db; public function __construct(EasyDB $db) { $this->db = $db; $this->pwhash = new PasswordHash('sha256'); } public function signUp(string $username, string $password) { // New: $hashed = hash('sha256', $password); if ($db->exists( "SELECT count(id) FROM users WHERE pw_dedupe = ?", $hashed )) { throw new Exception( "You cannot use the same username or password as another user." ); } // Original: if ($db->exists( "SELECT count(id) FROM users WHERE username = ?", $username )) { throw new Exception( "You cannot use the same username or password as another user." ); } $this->db->insert('users', [ 'username' => $username, 'pwhash' => $this->pwhash->hash($password), // Also new: 'pw_dedupe' => $hashed ]); } public function logIn(string $username, string $password): int { // This is unchanged: $user = $this->db->row("SELECT * FROM users WHERE username = ?", $username); if (empty($user)) { throw new Exception("Security Error: No user"); } if ($this->pwhash->verify($password, $user['pwhash'])) { throw new Exception("Security Error: Incorrect password"); } return (int) $user['id']; }}
Surely, this won’t completely undermine the security of the password hashing algorithm?
Is this a face that would mislead you? (Art by Khia.)
Three years later, a bored hacker discovers a read-only SQL injection vulnerability, and is immediately able to impersonate any user in your application.
What could have possibly gone wrong?
Cryptography Bugs Are Subtle
https://twitter.com/SwiftOnSecurity/status/832058185049579524
In order to understand what went wrong in our hypothetical scenario, you have to kinda know what’s under the hood in the monstrosity that our hypothetical developer slapped together.
PBKDF2
PBKDF2 stands for Password-Based Key Derivation Function #2. If you’re wondering where PBKDF1 went, I discussed its vulnerability in another blog post about the fun we can have with hash functions.
PBKDF2 is supposed to be instantiated with some kind of pseudo-random function, but it’s almost always HMAC with some hash function (SHA-1, SHA-256, and SHA-512 are common).
So let’s take a look at HMAC.
HMAC
HMAC is a Message Authentication Code algorithm based on Hash functions. Its definition looks like this:
Source: Wikipedia
For the purposes of this discussion, we only care about the definition of K’.
If your key (K) is larger than the block size of the hash function, it gets pre-hashed with that hash function. If it’s less than or equal to the block size of the hash function, it doesn’t.
So, what is PBKDF2 doing with K?
How PBKDF2 Uses HMAC
The RFCs that define PBKDF2 aren’t super helpful here, but one needs only look at a reference implementation to see what’s happening.
// first iteration$last = $xorsum = hash_hmac($algorithm, $last, $password, true);// perform the other $count - 1 iterationsfor ($j = 1; $j < $count; $j++) {$xorsum ^= ($last = hash_hmac($algorithm, $last, $password, true));}$output .= $xorsum;
If you aren’t fluent in PHP, it’s using the previous round’s output as the message (m in the HMAC definition) and the password as the key (K).
Which means if your password is larger than the block size of the hash function used by HMAC (which is used by PBKDF2 under the hood), it will be pre-hashed.
The Dumbest Exploit Code
Okay, with all of that in mind, here’s how you can bypass the authentication of the above login script.
$httpClient->loginAs( $user['username'], hex2bin($user['pw_dedupe']));
Since K’ = H(K) when Length(K) > BlockSize(H), you can just take the pre-calculated hash of the user’s password and use that as the password (although not hex-encoded), and it will validate. (Demo.)
This is probably how you feel right now, if you didn’t already know about this.
(Art by Khia.)
Above, when HMAC defined K’, it included a variant in its definition. Adding variants to hash function is a dumb way to introduce the risk of collisions. (See also: Trivial Collisions in IOTA’s Kerl hash function.)
Also, some PBKDF2 implementations that mishandle the pre-hashing of K, which can lead to denial-of-service attacks. Fun stuff, right?
Some Dumb Yet Effective Mitigations
Here are a few ways the hypothetical developer could have narrowly avoided this security risk:
- Use a different hash function for PBKDF2 and the deduplication check.
- e.g. PBKDF2-SHA256 and SHA512 would be impervious
- Always pre-hashing the password before passing it to PBKDF2.
- Our exploit attempt will effectively be double-hashed, and will not succeed.
- Use HMAC instead of a bare hash for the deduplication. With a hard-coded, constant key.
- Prefix the password with a distinct domain separation constant in each usage.
- Truncate the hash function, treating it as a Bloom filter.
- Some or all of the above.
- Push back on implementing this stupid feature to begin with.
Subtle problems often (but not always) call for subtle solutions.
Password Security Advice, Revisited
One dumb mitigation technique I didn’t include above is, “Using passwords shorter than the block size of the hash function.” The information security industry almost unanimously agrees that longer passwords are better than short ones (assuming both have the same distribution of possible characters at each position in the string, and are chosen randomly).
But here, we have a clear-cut example where having a longer password actually decreased security by allowing a trivial authentication bypass. Passwords with a length less than or equal to the block size of the hash function would not be vulnerable to the attack.
For reference, if you ever wanted to implement this control (assuming ASCII characters, where 1 character = 1 byte):
- 64-character passwords would be the upper limit for MD5, SHA-1, SHA-224, and SHA-256.
- 128-character passwords would be the upper limit for SHA-384 and SHA-512.
- (Since the above example used 256-character passwords, it always overflows.)
But really, one of the other mitigation techniques would be better. Limiting the input length is just a band-aid.
Another Non-Obvious Footgun
Converse to the block size of the hash function, the output size of the hash function also affects the security of PBKDF2, because each output is calculated sequentially. If you request more bytes out of a PBKDF2 function than the hash function outputs, it will have to work at least twice as hard.
This means that, while a legitimate user has to calculate the entire byte stream from PBKDF2, an attacker can just target the first N bytes (20 for SHA-1, 32 for SHA-256, 64 for SHA-512) of the raw hash output and attack that instead of doing all the work.
Is Cryptography Hard?
I’ve fielded a lot of questions in the past few weeks, and consequently overheard commentary from folks insisting that cryptography is a hard discipline and therefore only the smartest should study it. I disagree with that, but not for the reason that most will expect.
Many of the people reading this page will tell me, “This is such a dumb and obvious design flaw. Nobody would ship that code in production,” because they’re clever or informed enough to avoid disaster.
Yet, we still regularly hear about JWT alg=none vulnerabilities. And Johnny still still can’t encrypt. So cryptography surely can’t be easy.
Art by Riley.
Real-world cryptographic work is cognitively demanding. It becomes increasingly mathematically intensive the deeper you study, and nobody knows how to name things that non-cryptographers can understand. (Does anyone intuitively understand what a public key is outside of our field?)
In my opinion, cryptography is certainly challenging and widely misunderstood. There are a lot of subtle ways for secure building blocks to be stitched together insecurely. And today, it’s certainly a hard subject for most people to learn.
However, I don’t think its hardness is unavoidable.
Descending the Mohs Scale
There is a lot of gatekeeping in the information security community. (Maybe we should expect that to happen when people spend too many hours staring at firewall rules?)
“Don’t roll your own crypto” is something you hear a lot more often from software engineers and network security professionals than bona fide cryptographers.
Needless gatekeeping does the cryptographic community a severe disservice. I’m not the first to suggest that, either.
From what I’ve seen, far too many people feel discouraged from learning about cryptography, thinking they aren’t smart enough for it. It would seem A Mathematician’s Lament has an echo.
If it can be said that cryptography is hard today, then that has more to do with incomplete documentation and substandard education resources available for free online than some inherent mystical quality of our discipline that only the christened few can ever hope to master.
Cryptographers and security engineers that work in cryptography need more developers to play in our sandbox. We need more usability experts to discuss improvements to our designs and protocols to make them harder to misuse. We also need science communication experts and a legal strategy to stop idiotic politicians from trying to foolishly ban encryption.
So if you’re reading my blog, wondering, “Will I ever be smart enough to understand this crap?” the answer is probably “Yes!” It might not be easy, but with any luck, it’ll be easier than it was for the forerunners in our field.
If a random furry can grok these topics, what’s stopping you? This dhole believes in you.
(Art by Khia.)
I have one request, though: If you’re thinking of learning cryptography, please blog about your journey (even if you’re not a furry). You may end up filling in a pothole that would otherwise have tripped someone else up.
https://soatok.blog/2020/11/27/the-subtle-hazards-of-real-world-cryptography/
#computerSecurity #cryptographicHashFunction #cryptography #dumbExploit #HMAC #infosec #passwordHashing #PBKDF2 #SecurityGuidance
There are several different methods for securely hashing a password server-side for storage and future authentication. The most common one (a.k.a. the one that FIPS allows you to use, if compliance matters for you) is called PBKDF2. It stands for Password-Based Key Derivation Function #2.Why #2? It’s got nothing to do with pencils. There was, in fact, a PBKDF1! But PBKDF1 was fatally insecure in a way I find very interesting. This StackOverflow answer is a great explainer on the difference between the two.
Very Hand-Wavy Description of a Hash Function
Let’s defined a hash function as any one-way transformation of some arbitrary-length string () to a fixed-length, deterministic, pseudo-random output.Note: When in doubt, I err on the side of being easily understood by non-experts over pedantic precision. (Art by Swizz)
For example, this is a dumb hash function (uses SipHash-2-4 with a constant key):
function dumb_hash(string $arbitrary, bool $raw_binary = false): string{ $h = sodium_crypto_shorthash($arbitrary, 'SoatokDreamseekr'); if ($raw_binary) { return $h; } return sodium_bin2hex($h);}
You can see the output of this function with some sample inputs here.
Properties of Hash Functions
A hash function is considered secure if it has the following properties:
- Pre-image resistance. Given , it should be difficult to find .
- Second pre-image resistance. Given , it should be difficult to find such that
- Collision resistance. It should be difficult to find any arbitrary pair of messages () such that
That last property, collision resistance, is guaranteed up to the Birthday Bound of the hash function. For a hash function with a 256-bit output, you will expect to need on average trial messages to find a collision.
If you’re confused about the difference between collision resistance and second pre-image resistance:
Collision resistance is about finding any two messages that produce the same hash, but you don’t care what the hash is as long as two distinct but known messages produce it.
On the other paw, second pre-image resistance is about finding a second message that produces a given hash.
Exploring PBKDF1’s Insecurity
If you recall, hash functions map an arbitrary-length string to a fixed-length string. If your input size is larger than your output size, collisions are inevitable (albeit computationally infeasible for hash functions such as SHA-256).But what if your input size is equal to your output size, because you’re taking the output of a hash function and feeding it directly back into the same hash function?
Then, as explained here, you get an depletion of the possible outputs with each successive iteration.
But what does that look like?
Without running the experiments on a given hash function, there are two possibilities that come to mind:
- Convergence. This is when will, for two arbitrary messages and a sufficient number of iterations, converge on a single hash output.
- Cycles. This is when for some integer .
The most interesting result would be a quine, which is a cycle where (that is to say, ).
The least interesting result would be for random inputs to converge into large cycles e.g. cycles of size for a 256-bit hash function.
I calculated this lazily as the birthday bound of the birthday bound (so basically the 4th root, which for is ).Update: According to this 1960 paper, the average time to cycle is , and cycle length should be , which means for a 256-bit hash you should expect a cycle after about or about 128 bits, and the average cycle length will be about . Thanks Riastradh for the corrections.
Conjecture: I would expect secure cryptographic hash functions in use today (e.g. SHA-256) to lean towards the least interesting output.
An Experiment Design
Since I don’t have an immense amount of cheap cloud computing at my disposal to run this experiments on a real hash function, I’m going to cheat a little and use my constant-key SipHash code from earlier in this post. In future work, cryptographers may find studying real hash functions (e.g. SHA-256) worthwhile.Given that SipHash is a keyed pseudo-random function with an output size of 64 bits, my dumb hash function can be treated as a 64-bit hash function.
This means that you should expect your first collision (with 50% probability) after only trial hashes. This is cheap enough to try on a typical laptop.
Here’s a simple experiment for convergence:
- Generate two random strings .
- Set .
- Iterate until .
You can get the source code to run this trivial experiment here.
Clone the git repository, run
composer install
, and thenphp bin/experiment.php
. Note that this may need to run for a very long time before you get a result.If you get a result, you’ve found a convergence.
If the loop doesn’t terminate even after 2^64 iterations, you’ve definitely found a cycle. (Actually detecting a cycle and analyzing its length would require a lot of memory, and my simple PHP script isn’t suitable for this effort.)
“What do you mean I don’t have a petabyte of RAM at my disposal?”
What Does This Actually Tell Us?
The obvious lesson: Don’t design key derivation functions like PBKDF1.But beyond that, unless you can find a hash function that reliably converges or produces short cycles (, for an n-bit hash function), not much. (This is just for fun, after all!)
Definitely fun to think about though! (Art by circuitslime)
If, however, a hash function is discovered to produce interesting results, this may indicate that the chosen hash function’s internal design is exploitable in some subtle ways that, upon further study, may lead to better cryptanalysis techniques. Especially if a hash quine is discovered.
(Header art by Khia)
https://soatok.blog/2020/05/05/putting-the-fun-in-hash-function/
#crypto #cryptography #hashFunction #SipHash
Canonicalization Attacks occur when a protocol that feeds data into a hash function used in a Message Authentication Code (MAC) or Digital Signature calculation fails to ensure some property that’s expected of the overall protocol.
The textbook example of a canonicalization attack is the length-extension attack against hash functions such as MD5–which famously broke the security of Flickr’s API signatures.
But there’s a more interesting attack to think about, which affects the design of security token/envelope formats (PASETO, DSSE, etc.) and comes up often when folks try to extend basic notions of authenticated encryption (AE) to include additional authenticated (but unencrypted) data (thus yielding an AEAD mode).
Let’s start with a basic AE definition, then extend it to AEAD poorly, then break our extension. Afterwards, we can think about strategies for doing it better.
Turning CTR+HMAC into AEAD
Signal uses AES-CBC then HMAC-SHA2 to encrypt messages between mobile devices.
We often refer to this shape as “CBC+HMAC” (although this is a few typos away from being confused with a very different idea called CBC-MAC).
When CBC+HMAC is used with the AES block cipher with 256-bit keys and HMAC-SHA2, it becomes AES-256-CBC+HMAC-SHA256. What a mouthful!
Yuck! Who let a cryptography nerd name anything?
(Art by Lynx vs Jackalope)
In modern designs, AES-CTR is often preferable to AES-CBC, since you don’t have to deal with padding (or padding oracles).
Most systems these days prefer GCM over CBC+HMAC or CTR+HMAC. I don’t like AES-GCM (especially if your use-case is “support platforms without hardware acceleration”), but it’s hardly the worst choice for most applications. AES-GCM is a dedicated AEAD mode, while CBC+HMAC and CTR+HMAC merely provide AE.
Why Does Additional Data Matter?
Art: Harubaki
The main purpose of Additional Data (the AD in AEAD) is to bind an encrypted payload (ciphertext + authentication tag) to a given context. This is extremely helpful in mitigating Confused Deputy attacks.
Critically, this additional data is not encrypted. (At least, on this level; it’s probably wise to communicate over HTTPS!)
Naive CTR+HMAC to AEAD Extension
In a normal CTR+HMAC definition, your algorithm looks something like this:
- Generate a random nonce equal to the block size of your block cipher. (16 bytes for AES.)
- Encrypt your message with AES-CTR, using the given key and IV.
- Calculate the HMAC-SHA2 output of the IV followed by the ciphertext from step 2.
- Return IV, Ciphertext, MAC.
Decryption basically runs steps 3 and 2 in reverse: Recalculate the MAC (in constant-time!), decrypt ciphertext, return plaintext.
The most obvious way to extend this design to support additional authenticated data is to append it to the ciphertext.
This yields the following updated protocol:
- Generate a random nonce equal to the block size of your block cipher. (16 bytes for AES.)
- Encrypt your message with AES-CTR, using the given key and nonce.
- Calculate the HMAC-SHA2 output of the IV followed by the ciphertext from step 2, then the additional authenticated data.
- Return IV, Ciphertext, MAC.
Suffice to say, this is not a secure construction.
The Canonicalization Attack
Let’s say you built this extended protocol to encrypt a payload that looks like a URI string, but wanted to bind the token to a given browser session, so you use the HTTP User-Agent header as the AAD.
When you generate a token, you might produce the following:
const crypto = require('crypto');function splitKey(key) { let hmac; hmac = crypto.createHmac('sha256', key); hmac.update('encrypt-key'); let Ek = hmac.digest(); hmac = crypto.createHmac('sha256', key); hmac.update('hmac-key'); let Ak = hmac.digest(); return [Ek, Ak];}function encryptWithContext(plaintext, aad, key) { let [encKey, authKey] = splitKey(key); console.log(encKey, authKey); let nonce = crypto.randomBytes(16); const aes = crypto.createCipheriv('aes-256-ctr', encKey, nonce); const ciphertext = aes.update(plaintext); aes.final(); // Pay attention to this part: const hmac = crypto.createHmac('sha256', authKey); hmac.update(nonce); hmac.update(ciphertext); hmac.update(aad); return [nonce, ciphertext, hmac.digest()];}let plaintext = [ 'expires=1630223780', 'access_group=1234', 'subject=auth-v2.example.com', 'restrictions=on'].join('&');// expires=1630223780&access_group=1234&subject=auth-v2.example.com&restrictions=on// const key = crypto.randomBytes(32);let [nonce, ciphertext, tag] = encryptWithContext(plaintext, userAgent, key);
So here’s the clever attack: If you can shift bytes from the payload into the prefix of your User-Agent string, they’ll produce the same HMAC tag.
Attackers can truncate as much of the payload as they want by prepending it to the User-Agent included in their HTTP request.
You can even turn this:
expires=1630223780&access_group=1234&subject=auth-v2.example.com&restrictions=on
…into this:
expires=1630223780&access_group=1234&subject=auth-v2.example.com
…without invalidating the existing authentication tag–just by ensuring that the last 16 bytes of ciphertext are prepended to your User-Agent and removed from the payload.
More broadly, any time you have a multi-part message being fed into a hash function, if you aren’t careful with how you feed it into the hash function, you can induce trivial collisions.
See also: Iota’s Kerl hash function.
This is obviously true, because hash functions are deterministic: The same input will always produce the same output. If you can control one or more parts of a multi-part message, you can collide the input–thereby creating a collision in the output.
This can affect any protocol that depends on hash functions, but most obviously, HMAC and Digital Signature algorithms are in scope.
But what does “being careful” look like? Let’s look at a safe example.
Pre-Authentication Encoding (PAE)
Earlier I had mentioned PASETO and DSSE. They both have this notion of a “PAE” algorithm which aims to prevent canonicalization attacks.
PASETO’s definiton for PAE is as follows:
function LE64(n) { var str = ''; for (var i = 0; i < 8; ++i) { if (i === 7) { // Clear the MSB for interoperability n &= 127; } str += String.fromCharCode(n & 255); n = n >>> 8; } return str;}function PAE(pieces) { if (!Array.isArray(pieces)) { throw TypeError('Expected an array.'); } var count = pieces.length; var output = LE64(count); for (var i = 0; i < count; i++) { output += LE64(pieces.length); /*** Soatok Note: This JS pseudocode incorrectly assumes strings, rather than buffers. It's only meant to illustrate the idea. In real implementations, don't join Buffers with +. ***/ output += pieces[i]; } return output;}
What this does (with all lengths as 64-bit unsigned integers, serialized as 8 bytes):
- Prepend the number of parts being hashed.
- For each part, first prefix its length, then its value.
This is an obvious mitigation for canonicalization attacks:
- If you feed in a different number of pieces, the count (the first 8 bytes) will differ.
- If you try to move data from one piece to another, you’ll produce different lengths for both pieces, which will not yield an input collision to the hash function.
However, it’s important that both mechanism are in play to guarantee security:
- Without the length prefixes, we’re no different than the CTR+HMAC extension we defined above.
- Without the count prefix, it’s possible to drop pieces and then include a dummy “length” in the payload of others to create an input collision.
What’s an Input Collision?
First, you need to know what a collision attack is.
Consider a hash function, H(). If you can identify any two input messages (m1, m2) such that H(m1) = H(m2), you’ve found a collision in the output of the hash function.
An input collision is dumber than that.
If m1 is constructed from multiple segments with different meanings, you don’t need an m2. Just find multiple ways (the collisions) to result in the same m1 value (the input).
That’s what we did earlier when we shifted bytes from the ciphertext to the user agent.
DSSE Leaves Me Dizzy
It should come as no surprise that I find DSSE’s definition of PAE to be somewhat bizarre.
PAE(type, body) = "DSSEv1" + SP + LEN(type) + SP + type + SP + LEN(body) + SP + body+ = concatenationSP = ASCII space [0x20]"DSSEv1" = ASCII [0x44, 0x53, 0x53, 0x45, 0x76, 0x31]LEN(s) = ASCII decimal encoding of the byte length of s, with no leading zeros
The only thing that saves them from canonicalization attacks is that the number of pieces is constant.
If the number of pieces was variable (e.g. if the KEYID
was optionally included in the signature, but they forgot to always include a hard-coded 0 length if it was absent), you could defeat their flavor of PAE by constructing two different messages that produce the same hash in the digital signature algorithm.
This is because the number of pieces isn’t included in the DSSE definition. (If they ever support a variable number of components, and fail to include the count in the signature, they’ll be vulnerable.)
Amusingly, the rationale page for DSSE using PAE states:
- Why use PAE?
- Because we need an unambiguous way of serializing two fields, payloadType and payload. PAE is already documented and good enough. No need to reinvent the wheel.
…Yet, they didn’t actually [i]use the “already documented and good enough” definition of PAE from PASETO.
Let’s not use DSSE’s definition.
(Art by Lynx vs Jackalope)
Fixing AES-CTR+HMAC with PAE
This is pretty straightforward patch:
function encryptWithContext(plaintext, aad, key) { let [encKey, authKey] = splitKey(key); console.log(encKey, authKey); let nonce = crypto.randomBytes(16); const aes = crypto.createCipheriv('aes-256-ctr', encKey, nonce); const ciphertext = aes.update(plaintext); aes.final(); // Pay attention to this part: const hmac = crypto.createHmac('sha256', authKey);- hmac.update(nonce);- hmac.update(ciphertext);- hmac.update(aad);+ hmac.update(PAE([nonce, ciphertext, aad])); return [nonce, ciphertext, hmac.digest()]; }
The only conceivable way to attack this design is to aim for an integer overflow, which will require sending at least 2^63 bytes–at which point, you’re more likely to DDoS the target than succeed.
Wrapping Up
Canonicalization Attacks broadly aren’t well-understood or widely appreciated risks with cryptography protocol design outside of specialist circles (although many application security folks are at least aware of specific instances, i.e. Length Extension).
Part of the reason for this lack of knowledge transfer is that all of the AEAD modes defend against it by design, and most artisanal authenticated encryption constructions don’t bother with additional authenticated data, and most home-made cryptography protocols don’t even authenticate their ciphertexts correctly, and …
You get the point, I hope. There’s unaddressed concerns all the way down. Expecting people who aren’t specialized experts in this specific field to get all of them right is frankly unreasonable. In practice, outside of cryptography, canonicalization either doesn’t matter or there’s something else wrong that’s far more urgent.
https://soatok.blog/2021/07/30/canonicalization-attacks-against-macs-and-signatures/
#collisionAttacks #cryptographicHashFunction #cryptography #digitalSignatureAlgorithm #DSSE #HMAC #lengthExtensionAttacks #MACs #PASETO #SecurityGuidance
This is the first entry in a (potentially infinite) series of dead end roads in the field of cryptanalysis.Cryptography engineering is one of many specialties within the wider field of security engineering. Security engineering is a discipline that chiefly concerns itself with studying how systems fail in order to build better systems–ones that are resilient to malicious acts or even natural disasters. It sounds much simpler than it is.
If you want to develop and securely implement a cryptography feature in the application you’re developing, it isn’t enough to learn how to implement textbook descriptions of cryptography primitives during your C.S. undergrad studies (or equivalent). An active interest in studying how cryptosystems fail is the prerequisite for being a cryptography engineer.
Thus, cryptography engineering and cryptanalysis research go hand-in-hand.
Pictured: How I feel when someone tells me about a novel cryptanalysis technique relevant to the algorithm or protocol I’m implementing. (Art by Khia.)
If you are interested in exploring the field of cryptanalysis–be it to contribute on the attack side of cryptography or to learn better defense mechanisms–you will undoubtedly encounter roads that seem enticing and not well-tread, and it might not be immediately obvious why the road is a dead end. Furthermore, beyond a few comparison tables on Wikipedia or obscure Stack Exchange questions, the cryptology literature is often sparse on details about why these avenues lead nowhere.
So let’s explore where some of these dead-end roads lead, and why they stop where they do.
(Art by Kyume.)
Length Extension Attacks
It’s difficult to provide a better summary of length extension attacks than what Skull Security wrote in 2012. However, that only addresses “What are they?”, “How do you use them?”, and “Which algorithms and constructions are vulnerable?”, but leaves out a more interesting question: “Why were they even possible to begin with?”An Extensive Tale
Tale, not tail! (Art by Swizz.)To really understand length extension attacks, you have to understand how cryptographic hash functions used to be designed. This might sound intimidating, but we don’t need to delve too deep into the internals.
A cryptographic hash function is a keyless pseudorandom transformation from a variable length input to a fixed-length output. Hash functions are typically used as building blocks for larger constructions (both reasonable ones like HMAC-SHA-256, and unreasonable ones like my hash-crypt project).
However, hash functions like SHA-256 are designed to operate on sequential blocks of input. This is because sometimes you need to stream data into a hash function rather than load it all into memory at once. (This is why you can sha256sum a file larger than your available RAM without crashing your computer or causing performance headaches.)
A streaming hash function API might look like this:
class MyCoolHash(BaseHashClass): @staticmethod def init(): """ Initialize the hash state. """ def update(data): """ Update the hash state with additional data. """ def digest(): """ Finalize the hash function. """ def compress(): """ (Private method.) """
To use it, you’d callhash = MyCoolHash.init()
and then chain togetherhash.update()
calls with data as you load it from disk or the network, until you’ve run out of data. Then you’d calldigest()
and obtain the hash of the entire message.There are two things to take away right now:
- You can call
update()
multiple times, and that’s valid.- Your data might not be an even multiple of the internal block size of the hash function. (More often than not, it won’t be!)
So what happens when you call
digest()
and the amount of data you’ve passed toupdate()
is not an even multiple of the hash size?For most hash functions, the answer is simple: Append some ISO/IEC 7816-4 padding until you get a full block, run that through a final iteration of the internal compression function–the same one that gets called on
update()
–and then output the current internal state.Let’s take a slightly deeper look at what a typical runtime would look like for the MyCoolHash class I sketched above:
hash = MyCoolHash.init()
- Initialize some variables to some constants (initialization vectors).
hash.update(blockOfData)
- Start with any buffered data (currently none), count up to 32 bytes. If you’ve reached this amount, invoke
compress()
on that data and clear the buffer. Otherwise, just append blockOfData to the currently buffered data.- For every 32 byte of data not yet touched by
compress()
, invokecompress()
on this block (updating the internal state).- If you have any leftover bytes, append to the internal buffer for the next invocation to process.
hash.update(moreData)
- Same as before, except there might be some buffered data from step 2.
output = hash.digest()
- If you have any data left in the buffer, append a 0x80 byte followed by a bunch of 0x00 bytes of padding until you reach the block size. If you don’t, you have an entire block of padding (0x80 followed by 0x00s).
- Call
compress()
one last time.- Serialize the internal hash state as a byte array or hexadecimal-encoded string (depending on usage). Return that to the caller.
This is fairly general description that will hold for most older hash functions. Some details might be slightly wrong (subtly different padding scheme, whether or not to include a block of empty padding on
digest()
invocations, etc.).The details aren’t super important. Just the rhythm of the design.
init()
update()
- load buffer,
compress()
compress()
compress()
- …
- buffer remainder
update()
- load buffer,
compress()
compress()
compress()
- …
- buffer remainder
- …
digest()
- load buffer, pad,
compress()
- serialize internal state
- return
And thus, without having to know any of the details about what
compress()
even looks like, the reason why length extension attacks were ever possible should leap out at you!Art by Khia.
If it doesn’t, look closely at the difference between
update()
anddigest()
.There are only two differences:
update()
doesn’t pad before callingcompress()
digest()
returns the internal state thatcompress()
always mutatesThe reason length-extension attacks are possible is that, for some hash functions, the output of
digest()
is its full internal state.This means that you can run take an existing hash function and pretend it’s the internal state after an
update()
call instead of adigest()
call by appending the padding and then, after callingcompress()
, appending additional data of your choice.The (F)Utility of Length Extension
Length-Extension Attacks are mostly used for attacking naive message authentication systems where someone attempts to authenticate a message (M) with a secret key (k), but they construct it like so:
auth_code = vulnerable_hash(k.append(M))
If this sounds like a very narrow use-case, that’s because it is. However, it still broke Flickr’s API once, and it’s a popular challenge for CTF competitions around the world.Consequently, length-extension attacks are sometimes thought to be vulnerabilities of the construction rather than a vulnerability of the hash function. For a Message Authentication Code construction, these are classified under canonicalization attacks.
After all, even though SHA-256 is vulnerable to length-extension, but you can’t actually exploit it unless someone is using it in a vulnerable fashion.
That being said, it’s often common to say that hash functions like SHA-256 and SHA-512 are prone to length-extension.
Ways to Avoid Length-Extension Attacks
Use HMAC. HMAC was designed to prevent these kinds of attacks.Alternatively, if you don’t have any cryptographic secrets, you can always do what bitcoin did: Hash your hash again.
return sha256(sha256(message))
Note: Don’t actually do that, it’s dangerous for other reasons. You also don’t want to take this to an extreme. If you iterate your hash too many times, you’ll reinvent PBKDF1 and its insecurity. Two is plenty.Or you can do something really trivial (which ultimately became another standard option in the SHA-2 family of hash functions):
Always start with a 512-bit hash and then truncate your output so the attacker never recovers the entire internal state of your hash in order to extend it.
That’s why you’ll sometimes see SHA-512/224 and SHA-512/256 in a list of recommendations. This isn’t saying “use one or the other”, that’s the (rather confusing) notation for a standardized SHA-512 truncation.
Note: This is actually what SHA-384 has done all along, and that’s one of the reasons why you see SHA-384 used more than SHA-512.
If you want to be extra fancy, you can also just use a different hash function that isn’t vulnerable to length extension, such as SHA-3 or BLAKE2.
Questions and Answers
Art by Khia.Why isn’t BLAKE2 vulnerable to length extension attacks?
Quite simply: It sets a flag in the internal hash state before compressing the final buffer.If you try to deserialize this state then invoke
update()
, you’ll get a different result than BLAKE2’scompress()
produced duringdigest()
.For a secure hash function, a single bit of difference in the internal state should result in a wildly different output. (This is called the avalanche effect.)
Why isn’t SHA-3 vulnerable to length extension attacks?
SHA-3 is a sponge construction whose internal state is much larger than the hash function output. This prevents an attacker from recovering the hash function’s internal state from a message digest (similar to the truncated hash function discussed above).Why don’t length-extension attacks break digital signature algorithms?
Digital signature algorithms–such as RSASSA, ECDSA, and EdDSA–take a cryptographic hash of a message and then perform some asymmetric cryptographic transformation of the hash with the secret key to produce a signature that can be verified with a public key. (The exact details are particular to the signature algorithm in question.)Length-extension attacks only allow you to take a valid H(k || m) and produce a valid H(k || m || padding || extra) hash that will validate, even if you don’t know k. They don’t magically create collisions out of thin air.
Even if you use a weak hash function like SHA-1, knowing M and H(M) is not sufficient to calculate a valid signature. (You need to be able to know these values in order to verify the signature anyway.)
The security of digital signature algorithms depends entirely on the secrecy of the signing key and the security of the asymmetric cryptographic transformation used to generate a signature. (And its resilience to side-channel attacks.)
However, a more interesting class of attack is possible for systems that expect digital signatures to have similar properties as cryptographic hash functions. This would qualify as a protocol vulnerability, not a length-extension vulnerability.
TL;DR
Art by Khia.Length-extension attacks exploit a neat property of a few cryptographic hash functions–most of which you shouldn’t be using in 2020 anyway (SHA-2 is still fine)–but can only be exploited by a narrow set of circumstances.
If you find yourself trying to use length-extension to break anything else, you’ve probably run into a cryptographic dead end and need to backtrack onto more interesting avenues of exploitation–of which there are assuredly many (unless your cryptography is boring).
Next: Timing Side-Channels
https://soatok.blog/2020/10/06/dead-ends-in-cryptanalysis-1-length-extension-attacks/
#cryptanalysis #crypto #cryptographicHashFunction #cryptography #lengthExtensionAttacks