Distributed Averaging

By Rod Carvalho

1 – Problem Statement

Say we have a network of n agents where each agent takes measurements and communicates with a few other agents. The agents form a network. Each agent

  1. measures a given quantity
  2. stores the measurement in its internal state
  3. updates its internal state by replacing its own state with a weighted average of its neighboring agents’ states and its own state.

Our objective: we want each agent’s internal state to converge to the arithmetic average of all measurements.

What we want to know: which weights should each agent use to compute the weighted average?

Network of agents

What we want to devise is a sort of “network protocol” that allows a network of agents to compute the arithmetic average of all agents’ measurements. We want this protocol to be completely distributed, decentralized and non-hierarchical, i.e., all agents are “equal” (there is no hierarchy), and each agent follows the same rules.

The agents could compute the average in a centralized manner:

  1. the agents would elect a “leader” to be the network’s sink node
  2. the agents would pass messages to their neighbors and each node would route the messages to the sink node
  3. the sink node would collect all messages from all agents, and it would then compute the average of all measurements
  4. the sink node would broadcast the average value to its neighbors and the average would be diffused across the network.

Nevertheless, such a centralized approach has two major weaknesses:

  • lack of robustness: if the sink node is removed, nothing works. The nodes will be routing messages to a sink node that no longer exists, and the average will not be computed nor broadcast.
  • flooding: routing messages to the sink node will “flood” the network with messages. If we could somehow “combine” incoming messages at a given node and then pass just one message containing a linear combination of the incoming messages, we would save a lot of messages and thus avoid flooding the network. Some of you may think of this as a sort of Network Coding, and there are in fact similarities with Network Coding. However, in Network Coding one works with finite fields (otherwise the size of the alphabet would increase), while in this distributed averaging framework one works with the real field \mathbb{R}.

One can then conclude that it would be great to devise a distributed, decentralized, non-hierarchical approach in which all nodes are simultaneously source nodes, routers and sink nodes. [1]

__________

2 – The Network as a Graph

Let the network be modeled as an undirected connected graph \mathcal{G} = (\mathcal{N},\mathcal{E}), where:

  • \mathcal{N} = \{1,2, \ldots, n\} is the set of nodes.
  • \mathcal{E} = \{\{i,j\} \mid i, j \in \mathcal{N}, i \neq j\} is the set of edges. Graph  \mathcal{G} has no self-edges \{i,i\}.

The set of neighbors of node i (i.e., the neighborhood of node i) is denoted by

\mathcal{N}_i = \{j \mid \{i,j\} \in \mathcal{E}\}.

Each edge of undirected graph \mathcal{G} is represented by an unordered pair \{i,j\}, where i, j \in \mathcal{N}, i \in \mathcal{N}_j, and j \in \mathcal{N}_i. If the graph were directed, then each edge would be represented by an ordered pair (i,j), where node i would be the edge’s tail and node j would be the edge’s head.

The number of nodes and edges is given by the cardinality of sets \mathcal{N} and \mathcal{E}: we have |\mathcal{N}| = n nodes and |\mathcal{E}| = m edges, where m \geq n-1 and m \leq n (n-1) / 2. The extreme cases are:

  • m= n-1 is where graph \mathcal{G} is a tree.
  • m = \displaystyle\binom{n}{2} = \displaystyle\frac{n (n-1)}{2} is where graph \mathcal{G} is complete (i.e., every two distinct nodes are connected by an edge).

As mentioned before, graph \mathcal{G} = (\mathcal{N},\mathcal{E}) has no self-edges. Consider now the “augmented” graph \mathcal{G}_A = (\mathcal{N},\mathcal{E}_A), where the new (”augmented”) set of edges is

\mathcal{E}_A = \mathcal{E} \cup \{1, 1\} \cup \{2, 2\}\ldots \cup \{n, n\}.

Note that |\mathcal{E}_A| = |\mathcal{E}| + n. In other words, the augmented graph \mathcal{G}_A is graph \mathcal{G} plus n self-edges.

__________

3 – Distributed Averaging

Each agent, i.e., each network node i \in \mathcal{N} takes a measurement y_i \in \mathbb{R}, and stores it in its internal state x_i. The state x_i is thus initialized with the measurement, x_i(0) = y_i, and then the state is propagated forwards in time according to the recursion formula proposed by Lin Xiao and Stephen Boyd [2]

x_i(k+1) = W_{ii} x_i(k) + \displaystyle\sum_{j \in \mathcal{N}_i} W_{ij} x_j(k).

Note that this recursion can be interpreted as follows: node i \in \mathcal{N} initializes its internal state with the measurement y_i \in \mathbb{R}, and then updates its state by replacing it with the weighted average of its own and its neighboring agents’ states. Note that

  • W_{ii} is node i’s self weight.
  • W_{ij} is the weight used by node i to weigh x_j(k), where j \in \mathcal{N}_i.

Note that each agent can only pass a message to its neighboring agents. Since graph \mathcal{G} is connected, a message from any given node will eventually reach any other node in the network.

We will denote

\bar{y} = \displaystyle\frac{1}{n} \displaystyle\sum_{i=1}^{n} y_i = \frac{1}{n} \mathbf{1}_n^T y

where:

  • \mathbf{1}_n \in \mathbb{R}^n is the vector whose entries are all ones.
  • y \in \mathbb{R}^n is the vector of measurements.

Our goal is to find the set of weights \{W_{ij}\}_{i \in \mathcal{N}, j \in \mathcal{N}_i} and the set of self-weights \{W_{ii}\}_{i \in \mathcal{N}} so that all agents’ internal states converge to the average value of all agents’ measurements

\displaystyle\lim_{k \to \infty} x_i(k) = \bar{y}.

Let function f: \mathbb{R}^n \rightarrow \mathbb{R} be defined as

f(y) = \displaystyle\frac{1}{n} \mathbf{1}_n^T y,

then the aforementioned recursive equations give us an algorithm that (if some conditions are satisfied) will compute f(y) in a decentralized and distributed manner, where each node in the network uses only local information (the value of its neighbors’ states). If all nodes could use global information, i.e., if all nodes knew all other nodes’ measurements, then computing the average would be extremely easy. Imposing the constraint that each node can only use local information is what makes this problem interesting.

__________

4 – Geometric Interpretation

Let

x(k) = \left[\begin{array}{c}x_1(k)\\ x_2(k) \\ \vdots\\ x_n(k)\end{array}\right]

be the state vector at time step k \in \{0, 1, 2, \ldots\}, then the aforementioned recursion equation can be written more compactly as

x(k+1) = W x(k),

where:

  • W \in \mathbb{R}^{n \times n} is the weight matrix, whose sparsity pattern is imposed by graph \mathcal{G}_A.
  • the state vector is initialized with the measurement vector, x(0) = y.

Since x(0) = y, we have

x(k) = W^k y.

From a “geometric” point of view, our goal is to steer the state vector from x(0)=y to the desired / target state

x_{des} = \bar{y} \mathbf{1}_n = \displaystyle\frac{1}{n} \mathbf{1}_n \mathbf{1}_n^T y

in the least possible time. We are thus faced with the problem of finding a weight matrix W with a sparsity pattern imposed by graph \mathcal{G}_A such that the state vector converges to the desired state x_{des}

\displaystyle\lim_{k \to \infty} x(k) = x_{des} \Rightarrow \displaystyle\lim_{k \to \infty} W^k y = \frac{1}{n} \mathbf{1}_n \mathbf{1}_n^T y.

Following Lin Xiao and Stephen Boyd’s rationale [2], we would like the sum of the vector’s entries to be preserved

\mathbf{1}_n^T x(k+1) = \mathbf{1}_n^T x(k)

so that the arithmetic average is also preserved. Imposing the constraint that the sum / mean should be preserved is equivalent to imposing the constraint that the state vector x(\cdot) must lie on the hyperplane

\mathcal{P}(y) = \{z \in \mathbb{R}^n \mid \mathbf{1}_n^T z = \mathbf{1}_n^T y\}.

Note that hyperplane \mathcal{P}(y) is defined by the measurement vector y. Given two distinct measurement vectors y_1 and y_2, we will have two distinct (and parallel) hyperplanes \mathcal{P}(y_1) and \mathcal{P}(y_2).

In the next section, we will discuss some properties of the weight matrix.

__________

5 – The Weight Matrix

The weight matrix W \in \mathbb{R}^{n \times n} has a sparsity pattern imposed by the “structure” of the augmented graph \mathcal{G}_A:

  • W_{ij} \neq 0 (in general) if \{i, j\} \in \mathcal{E}
  • W_{ij} = 0 if \{i, j\} \notin \mathcal{E}.

The fewer edges graph \mathcal{G} has, the more sparse matrix W is. For example, if the graph is the one depicted below

Graph with n=5 nodes

then the weight matrix is

 W = \left[ \begin{array}{ccccc} W_{11}  & W_{12} & 0     & 0     & W_{15}\\ W_{21}  & W_{22} & W_{23} & W_{24} & W_{25}\\ 0       & W_{32} & W_{33} & W_{34} & 0\\ 0 & W_{42} & W_{43} & W_{44} & W_{45}\\ W_{51}  & W_{52} & 0     & W_{54} & W_{55}\\ \end{array}\right],

and the adjacency matrix is

A = \left[ \begin{array}{ccccc} 0 & 1 & 0 & 0 & 1\\ 1 & 0 & 1 & 1 & 1\\ 0 & 1 & 0 & 1 & 0\\ 0 & 1 & 1 & 0 & 1\\ 1 & 1 & 0 & 1 & 0\\ \end{array}\right].

Note that the sparsity of the adjacency matrix is imposed by graph \mathcal{G}, while the sparsity of the weight matrix is imposed by graph \mathcal{G}_A.

Let matrix \bar{W} be

\bar{W} =  \displaystyle\frac{1}{n} \mathbf{1}_n \mathbf{1}_n^T,

then we have

x_{des} =  \bar{W} y.

Note that imposing the constraint that each node can only use local information is equivalent to imposing the constraint that the weight matrix W must have a sparsity pattern given by graph \mathcal{G}_A. If we relax this constraint and allow nodes to use global information, which is equivalent to allowing the weight matrix NOT to have a sparsity pattern imposed by graph \mathcal{G}_A, then the desired / target vector can be reached in one single iteration: x(0) = y and x(1) = x_{des}.

We want to impose the sparsity constraint on the weight matrix, and we want the state vector to converge to the desired / target state

\displaystyle\lim_{k \to \infty} x(k) = x_{des} \Rightarrow \displaystyle\lim_{k \to \infty} W^k y = \frac{1}{n} \mathbf{1}_n \mathbf{1}_n^T y = \bar{W} y.

We thus have the following limit

\displaystyle\lim_{k \to \infty} W^k = \frac{1}{n} \mathbf{1}_n \mathbf{1}_n^T =  \bar{W}.

Note also that imposing the condition \mathbf{1}_n^T x(k+1) = \mathbf{1}_n^T x(k) implies that

\mathbf{1}_n^T W x(k) = \mathbf{1}_n^T x(k)

and consequently

\mathbf{1}_n^T W = \mathbf{1}_n^T,

which means that the sum of the entries of each column of the weight matrix W must be 1

W_{ii} + \displaystyle\sum_{j \in \mathcal{N}_i} W_{ji} = 1.

If all weights are nonnegative, then W is a left stochastic matrix.

Once again following Lin Xiao and Stephen Boyd’s rationale [2], we would like x_{des} = \frac{1}{n} \mathbf{1}_n \mathbf{1}_n^T y to be a fixed point of recursion x(k+1) = W x(k), which implies that

W x_{des} = x_{des},

which means that

\displaystyle\frac{1}{n} W \mathbf{1}_n \mathbf{1}_n^T y = \displaystyle\frac{1}{n} \mathbf{1}_n \mathbf{1}_n^T y,

and consequently

W \mathbf{1}_n = \mathbf{1}_n,

i.e., the sum of the entries of each row of the weight matrix W must be 1

W_{ii} + \displaystyle\sum_{j \in \mathcal{N}_i} W_{ij} = 1.

If all weights are nonnegative, then W is a right stochastic matrix.

Note that if the sum of the entries of each row and each column of the weight matrix W must be 1 then

W_{ii} + \displaystyle\sum_{j \in \mathcal{N}_i} W_{ij} = W_{ii} + \displaystyle\sum_{j \in \mathcal{N}_i} W_{ji},

which yields

\displaystyle\sum_{j \in \mathcal{N}_i} W_{ij} = \displaystyle\sum_{j \in \mathcal{N}_i} W_{ji},

and thus

\displaystyle\sum_{j \in \mathcal{N}_i} \left(W_{ij} - W_{ji}\right) = 0.

If W = W^T, then W_{ij} = W_{ji} and \sum_{j \in \mathcal{N}_i} \left(W_{ij} - W_{ji}\right) will necessarily be zero. In other words,

W = W^T \Rightarrow \displaystyle\sum_{j \in \mathcal{N}_i} \left(W_{ij} - W_{ji}\right) = 0,

but if \sum_{j \in \mathcal{N}_i} \left(W_{ij} - W_{ji}\right)  = 0 is true, W is not necessarily symmetric.

Summarizing, the weight matrix W should have the following properties:

  • sparsity pattern imposed by graph \mathcal{G}_A.
  • \mathbf{1}_n^T W  = \mathbf{1}_n^T to ensure sum / mean preservation.
  • W \mathbf{1}_n = \mathbf{1}_n to ensure that x_{des} is a fixed point of the recursion.

We now know the properties of matrix W. The difficulty is now to find a matrix with the desired properties. As I will show on a later occasion, finding such matrix is not particularly difficult. What is difficult is to find a weight matrix that yields fast convergence. In [2] Lin Xiao and Stephen Boyd use optimization methods to find the weight matrix that yields the higher rate of convergence.

For a more in-depth analysis and discussion, I recommend you take a look at Lin Xiao’s PhD thesis [3].

__________

6 – Concluding Remarks

In this post, we talked about an abstract network of “agents”. This network could be a wireless sensor network, a team of robotic vehicles or a social network (among other possibilities). Irrespective of what the agents are, the network is always represented by a connected graph, with “agents” at the nodes and communications links at the edges.

We showed that finding a distributed, decentralized, non-hierarchical averaging protocol boils down to finding a weight matrix W with certain desired properties. How to find such weight matrices will be addressed in future posts.

__________

7 – References

[1] Dimitri P. Bertsekas and John N. Tsitsiklis, “Parallel and Distributed Computation: Numerical Methods”, Prentice-Hall, Englewood Cliffs, New Jersey, 1989.

[2] Lin Xiao, Stephen Boyd, “Fast Linear Iterations for Distributed Averaging“, Systems and Control Letters, vol. 53, pp. 65-78, 2004.

[3] Lin Xiao, “Decomposition and Fast Distributed Iterations for Optimization of Networked Systems”, PhD thesis, Aeronautics and Astronautics Department, Stanford University, June 2004.

Tags: , ,

11 Responses to “Distributed Averaging”

  1. rod. Says:

    Gosh, this is the longest post I ever wrote. It took me several days to write it. IMHO it’s also the best post I ever wrote.

    Despite my late-night proof-reading, I am sure there are still typos. I will correct them as I find them.

    In the future, I might update this post slightly in case I find a better, more simple or more elegant way to present the ideas I addressed in here.

  2. Shubhendu Trivedi Says:

    Hey i enjoyed your post. It is the best post you have written in my opinion as well. I enjoy your posts otherwise.

    But i generally like posts that involve a lot of personal research. I like posts that are like research essays themselves! I like writing in this style myself whenever i get the time. It is a different case that i have only put 2 or 4 in my personal page so far only!

    Don’t worry about typos.
    Keep it up man. FANTASTIC post. FANTASTIC!

  3. Shubhendu Trivedi Says:

    As you pointed out, networks with a central “guiding brain” have a lot of problems. Like one point of operation means one point of failure. Also if the network size is increased many times. Say a thousand times, a lot of BW would be lost in the various nodes communicating their information to the central node. And global information is frequently out of date!So decentralized systems counter that very well. Giving a very robust alternative!

    Something that came to my mind while reading this post were some papers on adaptive routing in MANETs, which (by definition) dynamically form a temporary network, without using any existing network infrastructure or centralized administration. Among the many advantages you counted on because of decentralisation. One more could be of power saving as we can then work out most used and least used routes in the network better as well.

    A very good post indeed! The part on the weighted matrix especially.

  4. Yaroslav Bulatov Says:

    I don’t quite get how the observations enter into the picture — after all, if the matrix is nonnegative, irreducible, aperiodic, the initial state is forgotten when your recursion is x_{k+1}=W x_k.

    My first impression is that the recursion should be x_{k+1} = A x_k + y. Then the task would be to find a matrix with given sparsity pattern and large spectral gap which generates \bar{y} as the solution of that fixed point equation

  5. David Says:

    How long do you have to compute the average and to what precision?

  6. rod. Says:

    @ Yaroslav

    My knowledge of matrix theory is kind of rusty right now, but I will be thinking on your suggestions. The recursion that I presented in this post is the one I have seen in most papers in the area of average consensus / distributed averaging. And there’s a reason for it: the recursion x(k+1) = W x(k) can be regarded as a gradient descent method in which we want to minimize a convex, quadratic “cost” function. I will write more about it in my next post. It’s quite interesting.

    @ David

    The graph is connected, and therefore in a number of iterations equal to the graph’s diameter, each node should be able to compute the exact average of all measurements. If the network is too large and we can tolerate a small error in the computation of the average, then fewer recursions will be needed.

  7. Carnival of Mathematics 29 « Quomodocumque Says:

    [...] of Reasonable Deviations explains how the natural problem Say we have a network of agents where each agent takes measurements and [...]

  8. Aurelie Says:

    That post was a great read! We need more people blogging about their research! Keep up the good work.

  9. Randall at Starbucks « Reasonable Deviations Says:

    [...] I returned to Pasadena last Saturday. Yesterday (Sunday) I was having a drink and working on the distributed averaging problem at the Starbucks on South Lake Avenue and East California Boulevard. While I sipped my [...]

  10. rod. Says:

    @ Aurelie

    Thanks for the kudos :-)

    I must say I am surprised that anyone had the “courage” (i.e., time and patience) to read the whole post, eh eh eh

  11. Art Says:

    Hey, nice post. I was searching the net for info on distributed averaging.

    Has anybody ever worked on the problem where each agent’s measurement is not fixed but changes as a function of the agent’s current belief in the global average?

Leave a Reply