Exploring models calculating the likelihood of the 100th digit of pi being a 7

In creating a post explaining the relevance and prospect of using prediction markets in the study of logical uncertainty I produced a graph showing an example of how the answer to the question of if the 100th digit of pi is a 7, might be evaluated if each digit of pi needed to be expanded one after the other.

This is an accompanying post, something of an appendix to the future post, so might not make much sense on its own.

A possible graph of the stock price over time, assuming that to get to the 100th digit you need to expand each digit. I’ve included a few different models:

You can see that here for all models the price starts at 50¢ falls to 10¢ and from that point, the models diverge somewhat in their price prediction.

Explaining the models

There are two underlying models which lead to the production of all the lines in these graphs. The first is the model that as there are 10 possible values a digit of pi could take, that the probability of the 100th digit being 7 is 10%.

The other model is data-driven. This assumes that the probability of getting a 7 is equivalent to the fraction of digits seen so far that have been a 7.

To produce the different graphs different prior probabilities of the two models have been used.

The model which is completely confident the probability is 10% isn’t shown as that would be a horizontal line at 10¢. The opposite to that, the model which is completely confident in the numbers as a result of the digits seen so far is shown in blue. All the models shown start at 10% before they are shown any data, however, as soon as the first digit (a one) is shown to this model the probability of seeing a 7 is updated to 0. The model effectively believes here that as it has not seen a 7 so far, it will never see a 7. I would argue that this is perhaps a little naive and so I also added a model, which presumes the prior probability of 50%. That is to say before having seen any data either underlying model is equally likely.

The reason that this red line also dips almost to 0 to begin with is that the probability of seeing a 7 is calculated as a weighted sum of the probability of seeing a 7 given each model and the data so far.

In these models which combine the underlying two there is something more clever going on.

There are two underlying models, let us create a set containing those two:

Now what we want to find is the probability of any given digit being 7 (Making the assumption that there is no pattern to the digits) however if we use our data model, we will need to find the probability of a digit being 7 given what we have seen so far. Then we can calculate as follows:

To calculate the probability of any digit being 7 given each model and the data we have seen so far the formula are as follows:

These values produced by the models need to be multiplied by the probability of the models given the data.

We need to work out the probability of the models given the observations we have made so far (the data). To calculate this we use Bayes’ rule to convert the probability of the observations given the models, which we know how to calculate. To produce the conversion though we also need a prior probability of the models, this is what we vary from curve to curve in the graph, and the probability of the data independent of the model. This we calculate as the weighted sum of the probability of the data given the model times the model.

Using Bayes’ rule as follows:

The two models then calculate the probability of the data, under their respective selves as follows:

Here I am using to mean the total number of digits observed so far and to mean the occurrences of the ith digit so far.

Below is the same graph as before but with the vertical axis scaled so the maximum value is 10¢.

The 50:50 split (red line) uses the prior probabilities of each of the models being 50%. This has the result that for small sample sizes particularly to start with, the probability that the data is generated by a model with a more limited set of outputs, is itself more likely than one where each output is equally likely. All the curves jump in their prediction of the probability of a 7 after the first 7 occurs and you can see each subsequent occurrence of a 7 from subsequent jumps.

Before running the independent models I felt the red line would be the best representation, however, looking in hindsight at the models I am clearly far more sure of my prior belief that the distribution should be uniform that I’d thought.

I then thought that introducing some level of decreasing prior probability could be a more accurate model. The idea behind the adaptive prior models (All but the red and blue curves) is that to begin with we don’t have enough data to make a judgement of the probabilities using the data, we, therefore, have a prior probability that the uniform model is correct at close to 100%. We then reduce this in subsequent runs, proportional to the number of digits of pi we have observed.

In the linear backoff function, the prior probability of the uniform model is calculated as a line from 99.5% to just over 50% when we have observed 99 digits.

The data model prior’s calculation remain the same for all the curves the probability which is left:

There are two curves included in the graph showing an exponential decay. These decay at different speeds one an order of magnitude slower than the other.

In these exponential models, the prior probability of the uniform model starts at 1 and then decays to just over 50%. The rate of decay is adjusted with alpha. The yellow curve in the graph represents an alpha of 100. The orange curve showing an alpha of 1000.

As you can see he more sure we start of our uniform model the less effect the numeric model has.

Graphs produced in google sheets, the spreadsheet can be viewed here.