The Linux Random Number Generator

Introduction
There is a lot of misinformation out there about /dev/random and /dev/urandom with regards to the Linux kernel. Unfortunately, not even the random(4) manpage seems to get it right. So, rather than argue with everyone on the Internet, I'll post the inner workings of the Linux kernel with respect to randomness here.

TL;DR
If you're not interested in reading the full post, then call this a good summary:

The Entropy Pools
There are three arrays which store unsigned integers called entropy pools in the Linux kernel, each which maintain their own state. They are the input pool, the blocking pool, and the non-blocking pool. The input pool maintains a maximum internal state of 4096 bits. When looking at the value of "/proc/sys/kernel/random/entropy_avail", you are looking at the current estimate of the input pool state. The blocking and non-blocking pools each keep an internal state of just 1024 bits.

Below is an image showing the architecture of the Linux kernel random number generators. Each of the three pools are green. You'll notice that the input pool is fed with data from external events, such as timing interrupt requests, disk and network I/O, as well as human input on the keyboard and from the mouse. The data is hashed with SHA1 as part of the mixing functions in the cryptographically secure pseudorandom number generator (CSPRNG).

entropy

The blocking and non-blocking pools both feed from the same CSPRNG to produce their random output. There are three interfaces which user space can access to get this data from the CSPRNG. The /dev/random character device will block when the blocking entropy pool is exhausted, whereas both the /dev/urandom character device, and the get_random_bytes() system call will not block, even when entropy has been exhausted.

What's important to understand here, is that both the blocking pool, from which /dev/random draws its data, and the non-blocking pool, from which /dev/urandom draws its data, get their data from the same CSPRNG.

Theoretic versus Computational Security
In fact, there are really just two types of security that you should be aware of: theoretic security and computational, or practical security.

Theoretic Security: This is security which can be proven through a logical mathematical proof. Two types of proofs come to mind (there are probably others): The One Time Pad, and Shamir's Secret Sharing. Both of these proposals can be implemented in software algorithms, even though there may be some difficulty in the implementation.

Computational Security: This is security where algorithms are build from proofs that demonstrate a weakness in classical computing, such as the discrete logarithm problem and the quadratic residuosity problem. It is known that it takes considerable energy, time, and effort to work through these problems when certain criteria is met, such as factoring out primes from a quadratic with hundreds or thousands of digits in the number. Because it's computationally difficult to solve these problems, we consider algorithms that are built from them, computationally secure.

Probably every cryptographic algorithm that you are aware of, such as AES, RSA, SHA, and so forth, are computationally secure, not theoretically secure. This means, that an adversary with unlimited time, resources, and processing power could break computationally secure algorithms. This is why bit sizes are ever increasing- as processing power and computing "tricks" improve, the size of the problem needs to increase.

Random Numbers
There are basically two ways which you can generate random numbers- in hardware or in software (if you are getting random numbers from a Geiger Counter, even though radioactive decay is found in nature, you are using a hardware device to feed it into the computer). In both cases, however, the sequences of numbers could be "true random" or "pseudorandom".

First, I'm not going to jump into the long argument that is "true randomness". This quickly turns into a philosophical debate. Some will say that "true random" only exists in quantum nature, while others will explain that even quantum mechanics lives by mathematical equations and predictable output that we just haven't discovered yet. For me, I'm content with calling the output of some quantum events "true random". While all mathematical algorithms in software can only produce pseudorandom numbers, some mathematical algorithms can have such a large cycle, and their algorithm computationally secure, that their output is indistinguishable from true random.

Further, even though you have a "true random" number from nature, how do you say a number by itself is random? It's usually more practical to call a sequence of numbers unpredictable, rather than "random". For cryptography, that's really the only thing that matters. Consider the following sequences of text- one was encrypted with AES, one is "true random" data fed from a hardware random number generator, and the other sequence was pseudorandomly generated:

fa819c3e86d9397184628a5d1c06f65b
76a153e028f1c1084bb56679829e5e95
399916454dbb20cece79224b9be0ca5a

Looking at those three sequences, one has a very rigid structure. It is legitimate encrypted data with AES, while the other two strings are 16 bytes of data read from either /dev/urandom or /dev/hwrng. While examining the sequences of bits, is it computationally feasible to determine the underlying structure that was used to build the encrypted ciphertext, and get back to the original plaintext, without the key? From a theoretic perspective, the sequence of encrypted bits should appear indistinguishable from a truly unpredictable sequence of bits, such as those grabbed from the hardware random number generator.

I'll give you the solution:

$ echo "Random numberS" | aespipe | xxd -ps
Password: 12345678901234567890
fa819c3e86d9397184628a5d1c06f65b

However, suppose you have "true random" numbers; maybe you got them from radioactive decay, diode breakdown, or atmospheric noise. Regardless, you have a sequence of bits that are as "true random" as you could get. So, now what are you going to do with them? If you're going to study some informational theoretic security algorithm, you might be doing the world some good. However, if you're going to subject these numbers to a computationally secure, man-made, software algorithm, running on a deterministic computer, what was gained by having those "true random" numbers? The weakness will then lie in the software algorithm, the unpredictable internal state of the computer, and human factors; not necessarily the numbers from which it was drawn. Subjecting "true random" numbers to a theoretic algorithm is one thing; subjecting them to a man-made computer software algorithm that is at best computationally secure, is another. In all honesty, you are not much better off with a CSPRNG giving you the numbers than the you are the Universe giving them to you, if all you need to do is generate a long-term GPG or SSL key.

Don't get me wrong. I'm not suggesting that having "true random" numbers is not a good thing. It is. Injecting "true random" numbers into the input entropy pool of the kernel reseeds the CSPRNG, putting it in a "true random" state, from which it can build your pseudorandom numbers. Having "true random" numbers is a good thing, and increasing the input pool entropy is a great thing. In fact, due to the nature of "reseeding" the CSPRNG, the Linux random number generator "self heals". In other words, if the internal state of the system was known at a fixed point in time, reseeding the CSPRNG with entropy feed by "true random" numbers, changes the internal state of the random number generator. This means that although all future numbers could be determined, because the state of the system was compromised, future output has been changed, because the internal state of the system has changed.

A self healing CSPRNG is a great thing to have. And guess what? If you're constantly feeding the input pool with "true random" noise, then the output of /dev/urandom will be completely indistinguishable from "true random" data sequences, due to this reseeding.

/dev/random versus /dev/urandom
And now we get to the meat of the issue- which output interface to use for "the most demanding cryptographic computational long-term security"? First thing is first- both /dev/random and /dev/urandom are fed data from the same CSPRNG. Let me repeat that, with a bit more emphasis: Both /dev/random and /dev/urandom are fed data from the same CSPRNG. In fact, most who understand this, will explain that you should use /dev/urandom, and NOT /dev/random. The problem is, that /dev/random has the horrible bug of blocking when the input entropy pool is exhausted. So, if your application relies on /dev/random to get random numbers, there are pauses and delays until there is enough entropy in the input pool to reseed the CSPRNG, so you can get more data.

The point? Stop using /dev/random and start using /dev/urandom. To back up my claim, here are some people who back me up:

Here, cryptographer Dr. Daniel Bernstein puts me, personally, in my place on a mailing list (I've learned the error of that line of logic since then):

Cryptographers are certainly not responsible for this superstitious
nonsense. Think about this for a moment: whoever wrote the /dev/random
manual page seems to simultaneously believe that

(1) we can't figure out how to deterministically expand one 256-bit
/dev/random output into an endless stream of unpredictable keys
(this is what we need from urandom), but

(2) we _can_ figure out how to use a single key to safely encrypt
many messages (this is what we need from SSL, PGP, etc.).

For a cryptographer this doesn't even pass the laugh test.

I'm not saying that /dev/urandom has a perfect API. It's disappointingly
common for vendors to deploy devices where the randomness pool has never
been initialized; BSD /dev/urandom catches this configuration bug by
blocking, but Linux /dev/urandom (unlike Linux /dev/random) spews
predictable data, causing (e.g.) the widespread RSA security failures
documented on http://factorable.net. But fixing this configuration bug
has nothing to do with the /dev/random superstitions.

Dr. Bernstein continues in a paper about the problems of having too much entropy, and why that could be a bad thing. However, he repeats the same thing as above: use /dev/urandom. Dr. Bernstein also created a cryptographic library called "NaCL", which relies on /dev/urandom for its randomness.

Cryptographer Dr. Thomas Pornin also explains the nature of /dev/urandom:

/dev/urandom yields data which is indistinguishable from true randomness, given existing technology. Getting "better" randomness than what /dev/urandom provides is meaningless, unless you are using one of the few "information theoretic" cryptographic algorithm, which is not your case (you would know it).

In an additional post, he mentions:

There are many applications which read /dev/random as a kind of ritual, as if it was "better" than /dev/urandom, probably on a karmic level. This is plain wrong, especially when the alea is to be used with classical cryptographic algorithms (e.g. to generate a SSH server public key). Do not do that. Instead, use /dev/urandom and you will live longer and happier. Even for one-time pad.

Even Bram Cohen agrees.

In terms of software packages, the Go language uses /dev/urandom as its source for a CSPRNG, where available. OpenSSL uses /dev/urandom when available. OpenSSH will use OpenSSL's random number source, if compiled with OpenSSL support, which uses /dev/urandom. OpenVPN will also use OpenSSL's random number source, unless compiled with PolarSSL. Guess what? /dev/urandom.

The only software package I can think of that does not use /dev/urandom, but instead deliberately uses /dev/random, is when generating keys with GnuPG. However, this is only to guarantee that there has been sufficient entropy in the system to guarantee "randomness". When the input entropy pool is exhausted, /dev/urandom will happily give away a non-blocking stream of random data, where /dev/random will block. Adding non-malicious entropy to your system is not a bad thing. As we mentioned earlier, reseeding the CSPRNG with entropy from random sources introduces a "self-healing" RNG. Thus, GnuPG ensures that there is enough entropy to reseed the CSPRNG, such that the state of your system has "healed", and is in a completely different state than before you started. However, if you just linked /dev/random to /dev/urandom, and generated your GPG key, it would be no less secure, provided that the system you generated the key on is not compromised in any fashion.

The point? Use /dev/urandom.

Cryptographically Secure Pseudorandom Number Generators
I know what you're thinking- a pseudorandom number generator could not possibly be as "good" as a true random number generator. In many cases, this is true. There are pseudorandom number generators which are not cryptographically (which means "computationally") secure. Mersenne Twister comes to mind. It's "fast enough" and will provide pseudorandom numbers as long as you wish. It's the pseudorandom number generator for a great deal of programming languages, such as Python, and rightfully so. It's a good choice. But, it's not computationally strong. After observing a the sequence for a space of time, enough information can be gathered that will allow the attacker to predict the remaining sequence.

So, the question remains- can you have a computationally strong pseudorandom number generator? The answer is a big resounding YES! Wikipedia to the rescue:

There are a couple of theoretic designs which are computationally strong, but due to the nature of the algorithm, slow and impractical:

The CSPRNG in the Linux kernel uses SHA1 on the input pool for handing out our random numbers. The result is also put back into the input entropy pool for further evaluation. Even though theoretical attacks have been made on SHA1, the design is flexible enough that a different cryptographically secure hashing algorithm could be put in its place, such as SHA2, SHA3, WHIRLPOOL, etc.

Non-blocking /dev/random
If you absolutely must have a /dev/random device that doesn't block, either because some software applications are relying on it, such as GnuPG, or because you are working on some informational theroetic algorithm, then there are some options available to you.

The first is the HAVEGE software daemon. This queries a lot more timing events on your motherboard than the standard kernel, and can keep your input entropy pool at full. In my testing, this can produce around 1 MBps of pseudorandom data out of /dev/random. The only drawback, is that it hits the CPU fairly frequently, and can cause the load on the system to jump. However, I have found in practice that if you set the watermark to 512, which is typically considered enough bits of entropy for even the most stringent cryptographic keys, you'll never notice it running.

Other options include purchasing USB devices that use avalanche noise in electronic circuits. Using this USB device, along with rngd(8), can keep /dev/random from blocking, provided that the USB device remains operational. One drawback with these devices, is the capability that the circuit breaks in some fashion, producing biased and highly predictable data. Some implementations, such as the Entropy Key (which is no longer in production) solve this by doing internal bias testing and whitening, at the expense of throughput. These USB devices can be purchased for around $50, and could give you up to 40 KBps of random data per device.

Lastly, some chip makers are introducing a HWRNG (hardware random number generator) on the chip, such as Intel and Broadcom. All it takes to use these generators is by loading a kernel module. Typically, a /dev/hwrng device will then show up, which you again can use rngd(8) to feed its output into the input entropy pool for /dev/random. Unfortunately, it is unknown if the NSA has compromised the integrity of these hardware generators, if the specifications are not available, or an audit has not been performed. Just passing the randomness tests, such as DIEHARD and FIPS are not good enough to discover whether or not the random data is predictable by an attacker. However, throughput can be upwards of 500 MBps for these generators. If you need that sort of throughput, these are the generators for you.

Conclusion
Use /dev/urandom.