Using Tweets as a Random seed

Question

I would like to start by saying I know nothing about Cryptography and was reading up on how to choose a random seed and this link is something that I found. What I basically understood that the seed has to be sufficiently random that guessing the seed would be hard.

So the question is would the hash of a Tweet, at any given time, be a good candidate for a random seed? This is mainly because the content of a Tweet can be practically anything as it's being generated by a huge percentage of the world population.

That said, I understand it is possible to game it by mass tweeting a specific string continuously from multiple accounts flooding the tweet stream with predictable seeds. So if this can be mitigated by blacklisting the bad usernames, is using tweets for seeds a viable option?

Paul Uszak · Answer 1 · 2019-01-06T01:30:20.663

The other answers provide very good lists of reasons not to use Twitter as an entropy source. What follows is the flip side of your question:-

Why would you want to?

Tweets are typically read on tablets, PCs and phones. All of those have access to hardware entropy sources that can produce oodles of truly random bits for seeding anything. The zeitgeist is that you aim for 128 or 256 bits of entropy and then seed a cryptographically secure pseudo random number generator. That will meet all of your common random number needs.

You have seeding sources such as:-

The RdRand instruction built into most modern CPUs.
/dev/*random as part of the various favours of *nix.
Microsoft's Cryptography API.
The cameras built into phones and tablets.

There's not a lot of merit in pursuing Tweety entropy, other than for academic purposes.

score 20 · Accepted Answer · edited Oct 07 '21 at 07:34

What you are suggesting is not a good idea for a general purpose random number generator. It could be meaningful for very specific use cases if you need a random number generator whose output can be verified independently by a third party.

Even in those cases there are other sources of entropy which are potentially more suitable. The oldest mention of this approach known to me is RFC 2777. The suggested sources of entropy listed in RFC 2777 are:

lottery winning numbers
closing price of a stock on a particular day
daily balance in the US Treasury on a specified day
the volume of trading on the New York Stock exchange on a specified day
Sporting events

Every one of those looks like they are less likely to be subject to manipulation than posts on Twitter.

Reasons it's not a good general purpose approach

You'll have a cyclic dependency. Before you can retrieve posts from Twitter you'll need random numbers for a number of different purposes including:

If you use IPv4 you'll need randomness for the IPID header field.
If you use IPv6 you'll very likely need randomness for address configuration.
You need randomness to assign request IDs.
You need randomness for TCP sequence numbers.
You need randomness for SSL session setup.

Moreover the entropy of a Twitter post is hard to estimate. Some individual posts may have sufficient entropy on their own, but many will not. It's probably a safe estimate that posts have at least one bit of entropy on average, so if you were to hash together a thousand posts, you'd probably get sufficient entropy.

The resulting output is subject to manipulation by Twitter users. If your algorithm is known a user can compute what seed you'd calculate with different contents of their latest post and choose contents producing randomness that somehow suits that user.

The resulting output is also subject to manipulation by Twitter. Surely there will be Twitter employees who have access to information which will make the manipulation possible by any Twitter user even easier to pull off.

All of the input to the random number generator will be publicly known. That is bad for a general purpose random number generator, but can be useful in a few very specific use cases.

score 15 · Answer 3 · answered Jan 05 '19 at 12:05

How are you going to decide which tweet to use? Randomly? This quickly leads to a chicken / egg problem.

What if the chosen tweet is one word? That would not add a lot of entropy.

What if twitter is unavailable? Are you just stopping your service that relies on the entropy or are you going to continue regardless?

How are you going to keep the chosen tweet secret? You can use TLS, but TLS requires a random number generator to operate.

How are you going to blacklist in advance? You don't know the attackers in advance, right?

What if twitter changes his API? Would you keep running if the tweet collection agent crashes or returns bad results?

What if your government decides to block Twitter? There are plenty of governments doing that.

What if you choose a heavily retweeted tweet? How much entropy would that contain?

Having something that provides entropy is just the first step. In general you want something that is local and hard to influence and easy to understand / validate. Twitter doesn't seem to be a good option for any of those requirements.

score 3 · Answer 4 · answered Jan 06 '19 at 06:51

Other answers have already pointed out the chicken/egg catch22 problem of securely communicating over the Internet before you have a random number, and other showstoppers and possible problems. But you're screwen even against a fully-remote attacker that can't sniff your packets.

The OP commented:
The idea was to select the first tweet we see at the time we want to seed the random number generator to avoid the need for selecting the tweet at random. [...]

Tweets are public, and thus your pool of seeds is available to the attacker.

On average, Tweet throughput is around 6000 tweets per second (source). An attacker that can guess your tweet-query time within one second has a search space of about 6000 tweets. You could say that's equivalent to 12.5 bits of entropy, vastly smaller than the hash length. Or an attacker can widen the window to 1 minute for an equivalent entropy of 18.4 bits, still trivial to brute force in seconds, probably only limited by the time to download all those tweets.

If an attacker controls or knows when a seed was generated, you're screwed. The tighter a time bound they can put on it, the smaller their search space. Even worse, the attacker can simply keep widening their time window with earlier and earlier tweets if they don't find a hit in the first 1-second window they check.

Many use-cases for secure seeding of PRNGs expose the sequence to the attacker so they can test guesses of the seed. Try them with the same PRNG your software uses, and check whether the resulting sequence matches what they've already seen. Then, with high probability, they can predict the next number they'll see.

There can be false-positive matches that lead to the same initial sequence, for multiple reasons:

They can only see (or work backwards to) rng() & 0xff (low 8 bits) or rng() % 100 (or some better way of generating a 0..99 range), not the full 32 or 64-bit random number value of each PRNG step.
The PRNG has a large hidden internal state, and multiple initial states lead to the same sequence of random numbers. (This is already necessary so that knowing one rng result doesn't uniquely determine the next.)

But by observing enough random data from the same seed, an attack can test a seed to a very high probability.

With only 6000 possible candidates, the chances of one giving the same initial sequence you observed but actually being different is negligible.

And if you test them all over a likely window (and are right about that time window), you can detect when you've uniquely identified the one tweet that produces the sequence you're seeing, so you can potentially "lock on" quite quickly even if you don't get many bits of data per observation of the sequence.

If the random number was used as an encryption key, an attacker that can detect "sane looking" plaintext can still attack this way, even if the "sane-looking" check is very weak / inclusive.

Check which (of the ~6000) tweets as seeds lead to sane-looking plaintext from the first key.
Of those few candidate tweets, check which produce sane-looking plaintext from the second key generated from the same sequence. If there were multiple different possibly-sane plaintexts from the first key, this probably rules out most of them. Repeat as necessary.

This might not be the most plausible example, but this kind of idea is applicable for other kinds of things where you don't directly see the random sequence, only a cryptographically-secure use of it. But if you have any mechanism for testing a guess by going through all the steps the target of the attack would take, you can still attack.

Or if you can trigger a re-seed at some known time, and use the service with your own known data to get (probably) some of the first random values generated with that seed, you might be able to work out the seed that it will continue to use for other users' requests.

Only 6000 tweets is a small enough search space that you can start to expand your search space in other dimensions, like allowing for the possibility that other users' requests might have slipped in between yours while you're using it as an oracle to encrypt known plaintext that lets you check. (Or some equivalent thing that lets you really check your PRNG sequence guesses.)

Using Tweets as a Random seed

4 Answers4