Archive for the ‘Probability Theory’ Category

From classical deduction to Bayesian probability

August 15, 2012

Terence Tao on classical deduction and Bayesian probability:

In classical logic, one can represent one’s information about a system as a set of possible states that the system could be in, based on the information at hand. With each new measurement of the system, some possibilities could be eliminated, leading to an updated posterior set of information that is an improvement over the prior set of information. A good example of this type of updating occurs when solving a Sudoku puzzle; each new cell value that one learns about constrains the possible values of the remaining cells. Other examples can be found in the classic detective stories of Arthur Conan Doyle featuring Sherlock Holmes. Proof by contradiction can also be viewed as an instance of this type of deduction.

A modern refinement of classical deduction is that of Bayesian probability. Here, one’s information about a system is not merely represented as a set of possible states, but by a probability distribution on the space of all states, indicating one’s current beliefs on the likelihood of each particular state actually being the true state. Each new measurement of the system then updates a prior probability distribution to a posterior probability distribution, using Bayes’ formula

P (A \mid B) = \displaystyle \frac{P (B \mid A) P(A)}{ P(B) }.

Bayesian probability is widely used in statistics, in machine learning, and in the sciences.

To relate Bayesian probability to classical deduction, recall that every probability distribution has a support, which (in the case when the space of states is discrete) is the set of all states that occur with non-zero probability. When performing a Bayesian update on a discrete space, any state which is inconsistent with the new piece of information will have its posterior probability set to zero, and thus be removed from the support. Thus we see that whilst the probability distribution evolves by Bayesian updating, the support evolves by classical deductive logic. Thus one can view classical logic as the qualitative projection of Bayesian probability, or equivalently, one can view Bayesian probability as a quantitative refinement of classical logic.

Alternatively, one can view Bayesian probability as a special case of classical logic by taking a frequentist interpretation. In this interpretation, one views the actual universe (or at least the actual system) as just one of a large number of possible universes (or systems). In each of these universes, the system is in one of the possible states; the probability assigned to each state is then the proportion of the possible universes in which that state is attained. Each new measurement eliminates some fraction of the universes in a given state, depending on how likely or unlikely that state was to actually produce that measurement; the surviving universes then have a new posterior probability distribution, which is related to the prior distribution by Bayes’ formula.

It is instructive to interpret Sherlock Holmes‘ famous quote, “When you have eliminated all which is impossible, then whatever remains, however improbable, must be the truth,” from a Bayesian viewpoint. The statement is technically correct; however, when performing this type of elimination to an (a priori) improbable conclusion, the denominator in Bayes’ formula is extremely small, and so the deduction is unstable if it later turns out that some of the possibilities thought to have been completely eliminated, were in fact only incompletely eliminated. (See also the mantra “extraordinary claims require extraordinary evidence”, which can be viewed as the Bayesian counterpoint to Holmes’ classical remark.)

Another interesting place where one can contrast classical deduction with Bayesian deduction is with regard to taking converses. In classical logic, if one knows that A implies B, one cannot then deduce that B implies A. However, in Bayesian probability, if one knows that the presence of A elevates the likelihood that B is true, then an observation of B will conversely elevate the prior probability that A is true, thanks to Bayes’ formula: if P(B \mid A) > P(B), then P(A \mid B) > P(A). This may help explain why taking converses is an intuitive operation to those who have not yet been thoroughly exposed to classical logic. (It is also instructive to understand why this disparity between the two types of deduction is not in conflict with the previously mentioned links between the two. This disparity is roughly analogous to the disparity between worst-case analysis and average-case analysis.)

Bayesian probability can be generalised further; for instance, quantum mechanics (with the Copenhagen interpretation) can be viewed as a noncommutative generalisation of Bayesian probability, though the connection to classical logic is then lost when one is dealing with observables that do not commute. But this is another story…

__________

Please do note that Terence Tao’s original post contains neither links nor boldface highlighting. I took the liberty of adding those for convenience and emphasis. To improve legibility I also wrote the mathematical expressions in \LaTeX.

__________

Source:

Terence Tao, “A modern refinement of classical deduction is that of Bayesian probability”, Google Buzz, April 4, 2010.

Rainy days

July 7, 2011

A couple of weeks ago, Presh Talwalkar posted this puzzle:

In Mathland, the weather is described either as sunny or rainy. On a sunny day, there is an equal chance it will rain on the following day or be sunny. On a rainy day, however, there is a 70% chance it will rain on the following day versus a 30% chance it will be sunny. How many days is it expected to rain in a 365-day year? Assume the first day is sunny.

__________

My solution:

We will consider a period of N days. Since in our oversimplified world each day is either sunny or rainy, we introduce N random variables W_1, W_2, \dots, W_N taking values in \{S, R\}, where W_k = R means that the k-th day is rainy. We assume that the first day is sunny, i.e., W_1 = S. Hence, \mathbb{P} (W_1 = S) = 1 and \mathbb{P} (W_1 = R) = 0.

Let us now introduce N random variables X_1, X_2, \dots, X_N taking values in \{0, 1\}, where each X_k is defined as follows

X_k =\begin{cases} 1, & W_k = R\\ 0, & W_k = S\end{cases}

Note that the expected value of X_k is \mathbb{E}(X_k) = \mathbb{P} (W_k = R). We introduce a new random variable Y, defined by

Y = X_1 + X_2 + \dots + X_N

where Y is the number of rainy days in a period of N days. We want to find the expected number of rainy days, which is

\mathbb{E} (Y) = \mathbb{E} \left[ \displaystyle\sum_{k=1}^N X_k \right] = \displaystyle \sum_{k=1}^N \mathbb{E} (X_k) = \displaystyle \sum_{k=1}^N \mathbb{P} (W_k = R).

Hence, to compute the desired expected value, \mathbb{E} (Y), we must first compute the probabilities \mathbb{P} (W_k = R) for 1 \leq k \leq N.

Let the probability mass function (p.m.f.) on the k-th day be

\pi_k := \left[\begin{array}{cc} \mathbb{P} (W_k = S) & \mathbb{P} (W_k = R)\end{array}\right]

We know that the p.m.f. on the first day is \pi_1 = \left[\begin{array}{cc} 1 & 0\end{array}\right], as we assume the first day is sunny. Propagating the p.m.f. forwards in time, we obtain \pi_{k+1} = \pi_k P, where

P := \left[\begin{array}{cc} 0.5 & 0.5\\ 0.3 & 0.7\\\end{array}\right]

is the transition matrix. Hence, we have \pi_k = \pi_1 P^{k-1} and, thus,

\mathbb{P} (W_k = R) = \left[\begin{array}{cc} 1 & 0\end{array}\right] \left[\begin{array}{cc} 0.5 & 0.5\\ 0.3 & 0.7\\\end{array}\right]^{k-1} \left[\begin{array}{c} 0 \\ 1\\\end{array}\right]

Matrix P has the eigendecomposition P = Q D Q^{-1} and, hence, P^{k-1} = Q D^{k-1} Q^{-1}. We can finally compute \mathbb{P} (W_k = R) by matrix-vector multiplication, a task so excruciatingly dull that I am most happy to delegate it to my Pythonic minion:

from sympy import *

# create symbolic variable
k = Symbol('k')

# define transition matrix
P = Matrix([[0.5, 0.5], [0.3, 0.7]])

# eigendecomposition of the transition matrix
Q = Matrix([[1/sqrt(2), 5/sqrt(34)], [1/sqrt(2), -3/sqrt(34)]])
D = Matrix([[1, 0], [0, 0.2]])

# check if eigendecomposition is correct
print Q * D * Q.inv() - P

# create vectors
e1 = Matrix(2, 1, [1, 0])
e2 = Matrix(2, 1, [0, 1])

# compute and print probability that the k-th day is rainy
print e1.T * Q * Matrix([[1, 0], [0, 0.2**(k-1)]]) * Q.inv() * e2

This Python / SymPy script gives us

\mathbb{P} (W_k = R) = \frac{5}{8} \left(1 - 0.2^{k-1}\right)

and, finally, we obtain the desired expected value

\mathbb{E} (Y) = \displaystyle \sum_{k=1}^N \mathbb{P} (W_k = R) = \displaystyle \sum_{k=1}^N \frac{5}{8} \left(1 - 0.2^{k-1}\right) = \displaystyle\frac{5}{8}\sum_{k=0}^{N-1} \left(1 - 0.2^k\right)

which is equal to

\mathbb{E} (Y) = \displaystyle\frac{5 N}{8} - \frac{25}{32} (1 - 0.2^N) \approx \displaystyle\frac{5 N}{8} - \frac{25}{32}.

For N = 365, we get \mathbb{E} (Y) \approx 228 rainy days per year. Mathland is certainly not in Southern California ;-)

Dealing cards

April 25, 2011

Imagine that you have a deck of three cards: 2, 3, and 4. You shuffle the deck and deal out the three cards. What is the probability that:

  • the second card is even given that the first card is even?
  • the first two cards are even given that the third card is even?
  • the second card is even given that the first card is odd?
  • the second card is odd given that the first card is odd?

(this is problem 1.5.5 in Yates & Goodman [1])

[ source ]

__________

My solution:

Our experiment consists of shuffling the deck, dealing out the cards, and observing the sequence in which the three cards are dealt. Hence, each outcome or observation is a 3-tuple

(\text{1st card}, \text{2nd card}, \text{3rd card})

and the sample space, the set of all observations, is thus

S = \{(2,3,4), (2,4,3), (3,2,4), (3,4,2), (4,2,3), (4,3,2)\}.

Note that |S| = 3! = 6. We assume that the six observations are equally likely, i.e., \mathbb{P}(s) = 1/6 for all s \in S. Let E_i denote the event that the i-th card dealt is even numbered, and let \mathcal{O}_i denote the event that the i-th card dealt is odd numbered. Recall that an event is a subset of the sample space S and, thus, the event space (the set of all events) is 2^S. Hence, we have

E_1 = \{(2,3,4), (2,4,3), (4,2,3), (4,3,2)\}

E_2 = \{(2,4,3), (3,2,4), (3,4,2), (4,2,3)\}

and E_1 \cap E_2 = \{(2,4,3), (4,2,3)\}. Hence

\mathbb{P}(E_1) = \mathbb{P}\left(\displaystyle\bigcup_{s \in E_1} \{s\}\right) = \displaystyle\sum_{s \in E_1} \mathbb{P}(s) = 4/6

and \mathbb{P}(E_1 \cap E_2) = 2/6. Thus, the probability that the 2nd card is even given that the 1st card is even is the following

\mathbb{P}(E_2 \mid E_1) = \displaystyle\frac{\mathbb{P}(E_1 \cap E_2)}{\mathbb{P}(E_1)} = \displaystyle\frac{2/6}{4/6} = \frac{1}{2}.

Interpretation: if the 1st card is even numbered, after the 1st card is dealt (and observed) there will be only two cards left in the deck, one even numbered and another one odd numbered, and these two are equally likely. Hence, we conclude that the probability that the 2nd card is even given that the 1st card is even is 1/2.

Since there are only two even numbered cards in the deck, we have E_1 \cap E_2 \cap E_3 = \emptyset. Therefore, the probability that the first two cards are even given that the 3rd card is even is

\mathbb{P}(E_1 \cap E_2 \mid E_3) = \displaystyle\frac{\mathbb{P}(E_1 \cap E_2 \cap E_3)}{\mathbb{P}(E_3)} = \displaystyle\frac{\mathbb{P}(\emptyset)}{\mathbb{P}(E_3)} = 0.

Interpretation: if the 3rd card is even numbered, then the two cards previously dealt must be odd and even numbered (not necessarily in that order, of course), as we only have two even numbered cards in the deck. Thus, it’s impossible to get three even cards and \mathbb{P}(E_1 \cap E_2 \mid E_3) = 0.

Note that \mathcal{O}_1 =\{(3,2,4), (3,4,2)\} \subset E_2. Hence, we have that E_2 \cap \mathcal{O}_1 = \mathcal{O}_1. Therefore, the probability that the 2nd card is even given that the 1st card is odd is

\mathbb{P}(E_2 \mid \mathcal{O}_1) = \displaystyle\frac{\mathbb{P}(E_2 \cap \mathcal{O}_1)}{\mathbb{P}(\mathcal{O}_1)} = \displaystyle\frac{\mathbb{P}(\mathcal{O}_1)}{\mathbb{P}(\mathcal{O}_1)} = 1.

Interpretation: if the 1st card is odd numbered, then only even numbered cards remain in the deck. Thus, we know for sure that the 2nd card will be even if the 1st is odd.

Since we only have one odd numbered card in the deck, we can’t have an observation / outcome with two odd numbered cards, i.e., \mathcal{O}_1 \cap \mathcal{O}_2 = \emptyset. Hence, the probability that the 2nd card is odd given that the 1st card is odd is

\mathbb{P}(\mathcal{O}_2 \mid \mathcal{O}_1) = \displaystyle\frac{\mathbb{P}(\mathcal{O}_1 \cap \mathcal{O}_2)}{\mathbb{P}(\mathcal{O}_1)} = \displaystyle\frac{\mathbb{P}(\emptyset)}{\mathbb{P}(\mathcal{O}_1)} = 0.

Interpretation: if the 1st card is odd numbered, then only even numbered cards will remain in the deck. Thus, it’s impossible for the 2nd card to be odd when the 1st one is odd.

__________

References

[1] Roy D. Yates, David J. Goodman, Probability and stochastic processes: a friendly introduction for electrical and computer engineers, 2nd edition, Wiley, 2004.

How many dovetail shuffles suffice?

December 17, 2009

How many shuffles are necessary to randomize a deck of cards? Mathematically, card-shuffling can be viewed as a random walk on a finite group and, thus, it can be modeled by a Markov chain. We measure the degree of “randomness” of a deck of n cards after k shuffles using the Kullback-Leibler distance between the probability distribution of the Markov chain after k shuffles and the uniform probability distribution. Lastly, we discuss an intriguing phenomenon that resembles a phase transition.

__________

Introduction

Imagine that we have a deck of n cards. We assign number 1 to the card at the top of the deck, number 2 to the next card, and so forth, until we assign number n to the card at the bottom of the deck. This ordered arrangement of cards can be represented by an n-tuple (1,2,\dots,n-1,n). The simplest possible shuffle would be to take the card at the top of the deck and insert it at some random position in the deck, say, at the bottom. The new ordered arrangement of cards would then be represented by the n-tuple (2,3,\dots,n,1). This shuffling technique is called the top-in-at-random shuffle [1]. In this paper, we will consider the dovetail shuffle instead [2].

In order to model the shuffling process, we need to introduce some mathematical terminology first. We define [n] := \{1, 2, \dots, n\}. Let \mathcal{S}_n denote the symmetric group on set [n], i.e., \mathcal{S}_n is the group that consists of all bijections \pi : [n] \to [n] with function composition as the group operation. The elements \pi \in \mathcal{S}_n are called permutations. Given \pi, \sigma \in \mathcal{S}_n, then \pi \circ \sigma is the bijection obtained by applying first \sigma and then \pi. Every ordered arrangement, or configuration, of a deck of n cards can be associated with an element \pi \in \mathcal{S}_n. A sequence of k shuffles can be thought of as a composition of k bijections \pi^{(0)}, \pi^{(1)}, \dots, \pi^{(k-1)} \in \mathcal{S}_n, as follows

\pi^{(k-1)} \circ \dots \circ \pi^{(1)} \circ \pi^{(0)} = \sigma^{(k)} \in \mathcal{S}_n.

Hence, the card to which we assigned number i \in [n], and that was at position i initially, will be at position \sigma^{(k)}(i) after k shuffles. One can think of the shuffling process as a random walk on \mathcal{S}_n. Thus, we can define a Markov chain on \mathcal{S}_n. The shuffling method that one uses defines the transition probabilities of the Markov chain. It would be an interesting challenge to start with a desired transition probability matrix and try to design shuffling methods that yield a Markov chain as close as possible to the desired one.

Let us look again at the top-in-at-random shuffling example given above. If we take the top card and insert it at the bottom of the deck, then we know exactly which permutation \pi \in \mathcal{S}_n we did apply. However, the goal of card-shuffling is to “randomize” a deck, and if we know which permutation was applied with each shuffle, then we know exactly the new ordering of the deck. This is desirable only if our purpose is cheating. Therefore, in order to randomize a deck of cards we must apply a sufficient number of permutations chosen at random from \mathcal{S}_n, so that there is enough uncertainty on what the new ordering of the deck of cards is. In this post we will study how many shuffles suffice to attain a desired level of uncertainty.

__________

The Dovetail Shuffle

The dovetail shuffle, or riffle shuffle, is a common card-shuffling method that consists of cutting a deck of cards in two packets, and then interleaving the two packets together. A variation is the a-shuffle, which consists of cutting the deck in a packets and then interleaving them all together. In this post, we will consider dovetail 2-shuffles only.

A rising sequence is defined by Bayer & Diaconis in [2] as a “maximal subset of an arrangement of cards, consisting  of successive face values displayed in order”. Initially, the ordered arrangement of the deck is \pi(i) = i for all i \in [n] and, hence, there is exactly one rising sequence. Each dovetail shuffle tends to double the number of rising sequences, until a saturation effect takes place due to the finiteness of the size of the deck. This is an “experimental” result that anyone can verify by playing with an actual deck of cards. Do note that:

  • the minimum number of rising sequences in a deck of n cards is 1. A deck with a single rising sequence is one in which the ordering of the cards is \pi(i) = i for all i \in [n].
  • the maximum number of rising sequences in a deck of n cards is n. A deck with n rising sequences is one in which the ordering of the cards is \pi(i) = n + 1 - i for all i \in [n].
  • after k dovetail shuffles, there are at most \min\{2^k,n\} rising sequences.

We introduce function r : \mathcal{S}_n \to [n], where r(\pi) is defined to be the number of rising sequences in \pi. From Bayer & Diaconis’ paper [2], we know that the probability mass function on \mathcal{S}_n after k dovetail shuffles, p^k : \mathcal{S}_n \to [0,1], is defined by

p^k (\pi) = \displaystyle\binom{2^k + n - r(\pi)}{n} 2^{-n k}.

Since we know the probability mass function on \mathcal{S}_n after k shuffles, we can study shuffling from an information-theoretic viewpoint. In particular, we are interested in how the entropy of the probability mass function p^k evolves as the deck is shuffled.

__________

Entropy Generation

The Shannon entropy [3] of the probability mass function p^k is

H(p^k) = - \displaystyle\sum_{\pi \in \mathcal{S}_n} p^k (\pi) \log_2 p^k (\pi) = - \displaystyle\sum_{i=1}^n \sum_{\pi \in \mathcal{S}_n : r(\pi) = i} p^k (\pi) \log_2 p^k (\pi)

After k shuffles there are at most \min\{2^k,n\} rising sequences in the deck. Therefore, we only need to sum \min\{2^k,n\} terms

H(p^k) = - \displaystyle\sum_{i=1}^{\min\{2^k,n\}} \sum_{\pi \in \mathcal{S}_n : r(\pi) = i} p^k (\pi) \log_2 p^k (\pi).

As mentioned in [2], the cardinality of the set \{\pi \in \mathcal{S}_n : r(\pi) = i\} is the Eulerian number \left \langle {n\atop i} \right \rangle. Let p_i^k denote the probability p^k (\sigma) for any \sigma \in \{\pi \in \mathcal{S}_n : r(\pi) = i\}, as follows

p_i^k = \displaystyle\binom{2^k + n - i}{n} 2^{-n k}

for i \in \{1,2,\dots, \min\{2^k,n\}\}.  The entropy of the probability mass function p^k is thus given by

H(p^k) = - \displaystyle\sum_{i=1}^{\min\{2^k,n\}} \left \langle {n\atop i} \right \rangle p_i^k \log_2 p_i^k.

Hence, we can compute H(p^k) by summing \min\{2^k,n\} terms (instead of summing |\mathcal{S}_n| = n! terms), which makes the computation tractable for modestly large values of n, assuming that the Eulerian numbers can be computed efficiently, of course. A non-recursive formula for the Eulerian number \left \langle {n\atop i} \right \rangle is

\displaystyle \left \langle {n\atop i} \right \rangle = \displaystyle\sum_{j = 0}^i \binom{n+1}{j} (-1)^j (i-j)^n.

Using the expression for p_i^k above to compute the entropy H(p^k) is potentially dangerous: the binomial coefficients may take values which are too big, and this might lead to numerical errors. Therefore, it would be wise to break up the binomial coefficients into products

p_i^k =2^{- n k} \displaystyle\prod_{m=1}^n \left(1 + \frac{2^k - i}{m}\right)

and to write logarithms of products as sums of logarithms

\log_2 p_i^k = -n k + \displaystyle\sum_{m=1}^n \log_2 \left(1 + \frac{2^k - i}{m}\right).

However, do note that 2^{-n k} might take very small values as we might want to consider decks with, say, n = 312 cards. A safer way would be to “combine” 2^{-n k} with the Eulerian numbers

2^{-n k} \left \langle {n\atop i} \right \rangle = \displaystyle\sum_{j = 0}^i \binom{n+1}{j} (-1)^j \left(\frac{i-j}{2^k}\right)^n.

Let the uniform probability mass function on \mathcal{S}_n, u : \mathcal{S}_n \to [0,1], be defined by u(\pi) = \frac{1}{n!} for all \pi \in \mathcal{S}_n. The entropy of the uniform probability mass function u is

H(u) = - \displaystyle\sum_{\pi \in \mathcal{S}_n} u (\pi) \log_2 u(\pi) = \displaystyle\sum_{\pi \in \mathcal{S}_n} \frac{1}{n!} \log_2 (n!) = \log_2 (n!)

and, since the uniform distribution maximizes entropy, we conclude that 0 \leq H(p^k) \leq \log_2 (n!). We denote H_{max} = \log_2 (n!). In [1] Aldous & Diaconis use an algebraic argument to show that p^k \rightarrow u as k \rightarrow \infty, i.e., the probability mass funtion converges to the uniform one as the number of shuffles tends to infinity. Therefore, we can conclude that H(p^k) \rightarrow H(u) = H_{max} as k \rightarrow \infty and, hence, for every \varepsilon > 0 there exists a k^* \in \mathbb{N}_0 such that k \geq k^* implies that (1 - \varepsilon) H_{max} \leq H(p^k) \leq H_{max}. If one is given a value of \varepsilon > 0, then one can find how many shuffles suffice to attain an entropy higher than (1 - \varepsilon) H_{max}.

The question “how many shuffles suffice to randomize a deck of n cards?” is ill-posed. One must specify a measure of “randomness” (in this post, we use the Shannon entropy to measure randomness) and a value of \varepsilon to measure how much randomness is “enough” for us to say that a deck is mixed.

As mentioned before, each dovetail shuffle tends to double the number of rising sequences, until saturation occurs. Hence, after k dovetail shuffles, where k \in \mathbb{N}_0 is “small”, there should be r(\pi) \approx 2^k rising sequences. We can make the approximation

p^k (\pi) = \displaystyle\binom{2^k + n - r(\pi)}{n} 2^{-n k} \approx \binom{n}{n} 2^{-n k} = 2^{-n k}

which yields

H(p^k) = - \displaystyle\sum_{\pi \in \mathcal{S}_n} p^k (\pi) \log_2 p^k (\pi) \approx - \displaystyle\sum_{\pi \in \mathcal{S}_n} p^k (\pi) \log_2 \left(2^{-n k}\right)

and, thus

H(p^k) \approx \underbrace{\left(\sum_{\pi \in \mathcal{S}_n} p^k (\pi)\right)}_{=1} n k = n k

for “small” k. We conclude that dovetail shuffling generates entropy at a rate of approximately n bits per shuffle, until the deck becomes saturated. Using the approximation H(p^k) \approx n k, we have that maximum entropy would be attained after approximately k^{\dagger} = H_{max} / n = \log_2(n!) / n shuffles.

In figures 1 and 2 we plot (in blue) the normalized entropy H(p^k) / H_{max} as a function of the number of dovetail shuffles, k, for decks of sizes n = 52 and n = 104, respectively. We plot also (in red) the approximation H(p^k) \approx n k, and we plot (in magenta) the maximum normalized entropy.

[ Figure 1: Normalized entropy as a function of the number of shuffles for a deck with n = 52 cards ]

[ Figure 2: Normalized entropy as a function of the number of shuffles for a deck with n = 104 cards ]

Note that, as predicted theoretically, the entropy increases in an almost linear fashion during the first few shuffles. In fact, the approximation H(p^k) \approx n k is surprisingly good until k = 3 for n = 52, and until k = 4 for n = 104. For n = 52, we have k^{\dagger} \approx 4.33, which is a rough estimate of the number of shuffles after which the deck becomes saturated.

__________

Approach to Uniformity

As shown by Aldous & Diaconis in [1], the stationary probability mass function of the Markov chain is the uniform one. Therefore, the entropy generated by dovetail shuffling converges to the maximum entropy. We measure how “close” to the maximum entropy we are after k shuffles using the Kullback-Leibler distance [3] between the probability mass functions p^k and u, as follows

\begin{array}{rl} D (p^k \| u) &= \displaystyle\sum_{\pi \in \mathcal{S}_n} p^k (\pi) \log_2\left[\frac{p^k (\pi)}{u (\pi)}\right]\\ &= \displaystyle\sum_{\pi \in \mathcal{S}_n} p^k (\pi) \left(\log_2 p^k (\pi) - \log_2 u (\pi)\right)\\ &= \displaystyle\sum_{\pi \in \mathcal{S}_n} p^k (\pi) \log_2 p^k (\pi) - \displaystyle\sum_{\pi \in \mathcal{S}_n} p^k (\pi) \log_2 u (\pi)\\ &= \displaystyle\sum_{\pi \in \mathcal{S}_n} p^k (\pi) \log_2 p^k (\pi) + \displaystyle\sum_{\pi \in \mathcal{S}_n} p^k (\pi) \log_2 (n!)\\ &= - \underbrace{\left[- \displaystyle\sum_{\pi \in \mathcal{S}_n} p^k (\pi) \log_2 p^k (\pi)\right]}_{= H(p^k)} + \underbrace{\left(\displaystyle\sum_{\pi \in \mathcal{S}_n} p^k (\pi)\right)}_{=1} \log_2 (n!)\end{array}

and, hence, we obtain

D (p^k \| u) = \log_2 (n!) - H(p^k).

Since 0 \leq H(p^k) \leq \log_2 (n!), we have that

0 \leq D (p^k \| u) \leq \log_2 (n!).

We can rewrite the equation above in the form

D (p^k \| u) = H(u) - H(p^k),

or in the form

D (p^k \| u) = \displaystyle\max_q H(q) - H(p^k),

because among all probability mass functions on \mathcal{S}_n, the one that maximizes the entropy is the uniform distribution u. Since the logarithm is a strictly concave function, we can use Jensen’s inequality [3] to obtain

\begin{array}{rl} - H(p^k) &= \displaystyle\sum_{\pi \in \mathcal{S}_n} p^k (\pi) \log_2 p^k (\pi)\\ &\leq \log_2 \left[\displaystyle\sum_{\pi \in \mathcal{S}_n} \left(p^k (\pi)\right)^2\right] = \log_2 \left(\|p^k\|_2^2\right)\end{array}

and, therefore, we have the following upper bound on the Kullback-Leibler distance

\begin{array}{rl} D (p^k \| u) &= \log_2 (n!) - H(p^k)\\ &\leq \log_2 (n!) + \log_2 \left(\|p^k\|_2^2\right)\\ &= \log_2 \left(n! \|p^k\|_2^2\right)\\ &= 2 \log_2 \left(\sqrt{n!} \|p^k\|_2\right)\end{array}.

Since \|u\|_2 = 1 / \sqrt{n!}, the inequality above can be rewritten as follows

D (p^k \| u) \leq 2 \log_2 \left(\frac{\|p^k\|_2}{\|u\|_2}\right).

Imagine that our initial probability mass function is a unit mass at permutation \sigma, i.e., p^0(\pi) = \delta_{\pi\sigma}, where \delta_{ij} is the Kronecker delta. If this is the case, then there is zero uncertainty and, hence, the entropy is H(p^0) = 0, while the 2-norm is maximum, \|p^0\|_2 = 1. At the other extreme, the uniform probability mass function has maximum entropy H_{max}, but minimum 2-norm, \|u\|_2 = 1 / \sqrt{n!}. Hence, we have that

1 / \sqrt{n!} = \|u\|_2 \leq \|p^k\|_2 \leq \|\delta_{\pi\sigma}\|_2 = 1.

Since p^k \rightarrow u as k \rightarrow \infty, then we conclude that \|p^k\|_2 \rightarrow \|u\|_2 and, from inequality D (p^k \| u) \leq 2 \log_2 \left(\|p^k\|_2 / \|u\|_2\right), we have that D (p^k \| u) \rightarrow 0. For k sufficiently large, the decay of \|p^k\|_2 is given by the second largest (in absolute value) eigenvalue of the transition probability matrix of the Markov chain. Hence, for large k there exist constants c \in \mathbb{R}^{+} and \mu \in (0,1), such that

\|u\|_2 \leq \|p^k\|_2 \leq \|u\|_2 \left(1 + \frac{c}{2} \mu^k\right)

and, thus, 1 \leq \|p^k\|_2 / \|u\|_2 \leq \left(1 + (c / 2) \mu^k\right). Applying logarithms, we obtain

0 \leq \log_2\left(\frac{\|p^k\|_2}{\|u\|_2}\right) \leq \log_2 \left(1 + \frac{c}{2} \mu^k\right) \leq \frac{c}{2} \mu^k

where we used the inequality \log_2(1+x) \leq x. From the inequality above and the inequality on the relative entropy, we conclude that D (p^k \| u) \leq c \mu^k for sufficiently large k. Hence, the Kullback-Leibler distance converges to zero exponentially.

__________

Concluding Remarks

We have concluded that during the first k shuffles, where 2^k < n, the dovetail shuffling process generates entropy at a rate of approximately n bits per shuffle. However, when 2^k > n, then \min\{2^k,n\} becomes n and the sum in the formula for the entropy “saturates” in the sense that the number of terms being summed no longer increases with k. For k > \log_2(n), the deck is saturated and entropy approaches the maximum entropy asymptotically, as D (p^k \| u) decays exponentially. This interesting phenomenon resembles a phase transition and was pointed out by Trefethen & Trefethen in [4], while in [5] Stark et al. present detailed analytical results.

__________

References

[1] David Aldous, Persi Diaconis, Shuffling cards and stopping times, The American Mathematical Monthly, Vol. 93, No. 5, pp. 333–348, Mathematical Association of America, 1986.

[2] Dave Bayer, Persi Diaconis, Trailing the Dovetail Shuffle to its Lair, The Annals of Applied Probability, Vol. 2, No. 2, pp. 294-313, May 1992.

[3] Thomas M. Cover and Joy A. Thomas, Elements of Information Theory, John Wiley & Sons 2006.

[4] L. N. Trefethen, L. M. Trefethen, How many shuffles to randomize a deck of cards?, Proceedings of the Royal Society of London, Vol. 456, No. 2002, pp. 2561-2568, 2000.

[5] Dudley Stark, A. Ganesh, Neil O’Connell, Information Loss in Card Shuffling, HP Laboratories Bristol, September 1999.

__________

Related:


Follow

Get every new post delivered to your Inbox.

Join 76 other followers