Imagine that you go on a road trip with some friends. One of your friends has a video camera and films the best moments of the trip. After the trip, your friend edits the footage and creates a movie with the very best moments. Your friend could then burn a CD (or a DVD) containing the movie, and send it to you via post-mail. However, that solution involves:

• wasting time to go to the mail office.
• spending money on CD’s, stamps, and envelopes.

We could do better than that: why not compress the movie into one or more ZIP files and upload them to a hosting website such as Rapidshare? Bandwidth is cheap these days, and downloading involves only a few mouseclicks. Your friend uploads the ZIP files to a hosting website and sends you the list of URL’s via email, including the size of each file next to each URL. Let us assume that:

• you need to download all ZIP files before you can unzip them and watch the movie your friend created.

You then think: “in which order should I download the files”? This is a decision problem under certainty. Let us try to model it!

__________

Mathematical Model

Let us say that we have a total of $n$ files to download, where $n \geq 2$. We consider the following variables and parameters (and respective units):

• $x_i$: size of the $i$-th file (unit: Megabytes).
• $T_i^D$: time it takes to download the $i$-th file (unit: seconds).
• $T_i^W$: time one needs to wait after downloading the $i$-th file (unit: seconds).
• $R$: downloading speed (unit: Megabytes per second)

for all $i \in \mathcal{I}$, where $\mathcal{I} = \{1, 2, \dots, n\}$. The time it takes to download the $i$-th file is

$T_i^D = \displaystyle\frac{x_i}{R}$

and the time one needs to wait after downloading the $i$-th file is

$T_i^W = 60 x_i$.

The total downloading time is the sum of each file’s downloading time plus the time one needs to wait after downloading each file. Note that after we download the last file, we can unzip the $n$ files and obtain the full-length movie. Therefore, when counting the total downloading time, we do not need to include the waiting time of the last downloaded file. Let us denote by $j \in \mathcal{I}$ the index of the last downloaded file. Thus, the total downloading time is

$T = \displaystyle\sum_{i \in \mathcal{I}} T_i^D + \displaystyle\sum_{i \in \mathcal{I} \setminus \{j\} } T_i^W$,

which can also be written as

$T = \displaystyle\sum_{i \in \mathcal{I}} \left(T_i^D + T_i^W\right) - T_j^W$,

or, then, as follows

$T = \left(60 + \frac{1}{R}\right)\displaystyle\sum_{i \in \mathcal{I}} x_i - 60 x_j$.

__________

Suppose that our objective is to minimize the total downloading time $T$. Let us introduce the cost function $C: \mathcal{I} \to \mathbb{R}_0^+$ defined by

$C(j) = T = \left(60 + \frac{1}{R}\right)\displaystyle\sum_{i \in \mathcal{I}} x_i - 60 x_j$.

Note that our decision variable is $j \in \mathcal{I}$, i.e., our goal is to find the value of $j \in \mathcal{I}$ that minimizes the total downloading time $T$. Note also that:

• $\left(60 + \frac{1}{R}\right)\displaystyle\sum_{i \in \mathcal{I}} x_i$ is the “fixed cost”.
• $\left(- 60 x_j \right)$ is the “variable cost”.

We cannot “control” the fixed cost, but we can try to minimize the variable cost. The minimal cost is thus

$C^* = \displaystyle\min_{j \in \mathcal{I}} C(j) = \left(60 + \frac{1}{R}\right)\displaystyle\sum_{i \in \mathcal{I}} x_i - \max_{j \in \mathcal{I}} \left\{60 x_j \right\}$.

Hence, we can conclude that the optimal $j \in \mathcal{I}$ is

$j^* = \displaystyle\arg\max_{j \in \mathcal{I}} \left\{ x_j \right\}$,

i.e., we should leave the biggest of all $n$ files for last! This is quite intuitive. Given that

• the downloading speed, $R$, is assumed to be constant.
• we need to download all the $n$ files.

we can conclude that the only way to minimize the total downloading time is to minimize the total waiting time, and the way to do that is to download the biggest file at the end.

Alternatively, we could have used the following cost function

$C(j) = \displaystyle\sum_{i \in \mathcal{I} \setminus \{j\} } T_i^W = \left(\displaystyle\sum_{i \in \mathcal{I}} 60 x_i\right) - 60 x_j$

where cost equals waiting time (i.e., we ignore the fixed cost). The optimal solution will be the same as before

$j^* = \displaystyle\arg\max_{j \in \mathcal{I}} \left\{ x_j \right\}$,

and the only difference will be that the minimum cost will now be the minimum waiting time, not the minimum total downloading time.

__________

Concluding Remarks

The solution we have computed only tells us which file to download in the $n$-th download. The solution tells us nothing on which files to download on the previous $n-1$ downloads. We can then conclude that:

• there are $n!$ distinct ways of downloading the files.
• out of those $n!$ ways, $(n-1)!$ of them are optimal.

If all the files have the same size (i.e., if $x_i = \bar{x}, \, \forall i \in \mathcal{I}$), then there are $n!$ optimal ways of downloading them. In other words, when all files have the same size, the order in which one downloads them is irrelevant, which is somewhat obvious.

This problem could perhaps be generalized. Any ideas?

__________

Tags:

1. Vishal Says:

Rod,

The first problem could, perhaps, also be solved using the Rearrangement Inequality. Though I don’t come from a (math) olympiad background, I do know this inequality is very much a part of olympiad training when learning the topic of inequalities. It can certainly be used to prove the (familiar) AM-GM, the Cauchy-Schwarz, and the Chebyshev inequalities.

Your solution is, indeed, simple to understand and intuitive!

2. Howard Yeend Says:

Great article. You’re right that the solution is very intuitive.

Regarding generalisation: do you mean what if you’re completely unable to determine file size prior to ending the download? Or what if there’s an extra step involved in determining file size? (loading a “file info” page on the host for example, a process which takes a fixed time and invokes no penalty)

Something about this is reminiscent of bin-packing, but the solution is really just a consequence of the fact that the final downloaded file invokes no penalty. Of course, if you change it such that the time penalty is applied before the download, then all strategies are optimal.

Interesting stuff though.

• Rod Carvalho Says:

Frankly, I have no idea what I was thinking back in Feb. 2008 when I suggested the possibility of generalizing the problem. I edited the last few lines of the post, in which I mentioned generalizations, because I really can’t understand what I meant when I wrote them.

The approach discussed in this post is, essentially, decision under certainty. It’s somewhat austere, as there’s no room at all for uncertainty. How could one generalize the problem under perfect certainty? What if the downloading speed would depend on how many Megabytes one had downloaded before? The more one downloaded, the slower the download, for instance. Or, what if there was a ceiling on how much one could download and, once that ceiling was exceeded, the new rate would become $R' = \alpha R$, where $\alpha \in (0,1)$.

If we introduce uncertainty, then things can get even more interesting. For instance, suppose that one does not know the size of the files and that, instead, one has a probability distribution $p_i$ for the size of each file. One could then try to minimize the expected downloading time. Is this mathematically interesting? Yes! Does this make sense from a practical viewpoint? I don’t think so. How would one obtain reliable estimates of the probability distributions, after all?