## Archive for January, 2009

### Distance between two words

January 31, 2009

Consider a finite set $\Sigma$ with $|\Sigma|$ distinct elements which we will call symbols. We will call set $\Sigma$ an alphabet. A word (also known as string) of length $n$ over set $\Sigma$ is formed by constructing an $n$-tuple of symbols from alphabet $\Sigma$

$w = (w_1, w_2, w_3, \ldots, w_n)$

where $w_i \in \Sigma$ for all $i \in \{1, 2, \ldots, n\}$. A string of length $n$ will be called an $n$-string. The set of all $n$-strings is given by the cartesian product

$\Sigma^n = \Sigma \times \Sigma \times \ldots \times \Sigma = \displaystyle\prod_{i=1}^n \Sigma$.

For example, in coding theory one usually works with binary words, i.e., words over alphabet $\mathbb{F}_2 = \{0,1\}$. The set of all binary words of length $n$ is $\mathbb{F}_2^n = \{0,1\}^n$, whose cardinality is $| \mathbb{F}_2^n | = 2^n$.

In this post we will work only with words over the alphabet

$\Sigma = \{\text{a} ,\text{b} ,\text{c} ,\text{d} ,\text{e} ,\text{f} ,\text{g} ,\text{h} ,\text{i} ,\text{j} ,\text{k} ,\text{l} ,\text{m} ,\text{n} ,\text{o} ,\text{p} ,\text{q} ,\text{r} ,\text{s} ,\text{t} ,\text{u} ,\text{v} ,\text{w} ,\text{x} ,\text{y} ,\text{z}\}$

which is known as the Latin alphabet, and whose 26 symbols are known as letters. For example,  $\textbf{cat}$ and $\textbf{car}$ are represented by the 3-strings $s_1 = (\text{c}, \text{a}, \text{t})$ and $s_2 = (\text{c}, \text{a}, \text{r})$, respectively, where $s_1, s_2 \in \Sigma^3$.

__________

Hamming distance

Given two strings of equal length, the Hamming distance between them is the number of positions for which the corresponding symbols are different. In other words, the Hamming distance between two strings of equal length is the minimum number of symbol substitutions required to change one string into the other.

Let function $d_{H} : \Sigma^n \times \Sigma^n \rightarrow \mathbb{Z}_0^{+}$ be defined by

$d_{H} (v, w) = \displaystyle \sum_{i=1}^n \mu \left( v_i, w_i \right)$

where function $\mu : \Sigma \times \Sigma \rightarrow \{0,1\}$ is defined by

$\mu (a, b) = \displaystyle\left\{\begin{array}{rl} 0 & \text{if} \quad{} a = b\\ 1 & \text{if} \quad{} a \neq b\\ \end{array}\right.$

then $d_H (s_1, s_2)$ is the Hamming distance between $n$-strings $s_1$ and $s_2$. For example, if $s_1 = (\text{c}, \text{a}, \text{t})$ and $s_2 = (\text{c}, \text{a}, \text{r})$, then we have $d_H (s_1, s_2) = 1$ because $s_1$ and $s_2$ only differ in one symbol (the last one).

Suppose we are given an $n$-string over the Latin alphabet $\Sigma$. If we replace the letter at position $i$ with another letter in $\Sigma$, where $i \in \{1, 2, \ldots, n\}$, we can obtain a total of $|\Sigma| - 1$ new $n$-strings, each within Hamming distance $1$ of the original $n$-string. Let us call these $n$-strings neighbors of the original $n$-string. Thus, an $n$-string has a total of $(|\Sigma| - 1) n$ neighbors.

__________

Word graphs

If two $n$-strings are neighbors of one another when their Hamming distance is equal to $1$, then a graphical approach would be the natural way to go. Let $G = (N,E)$ be an undirected graph whose set of nodes is $N = \Sigma^n$ (set of all $n$-strings over alphabet $\Sigma$), and whose set of edges is $E \subset \Sigma^n \times \Sigma^n$. Two nodes are connected by an edge if their Hamming distance is equal to $1$. Hence the total number of nodes in $G$ is $|N| = |\Sigma^n| = 26^n$, and the total number of edges in $G$ is

$|E| = \displaystyle\frac{n | \Sigma^n|}{2} \left(| \Sigma| - 1\right) = \displaystyle\frac{n}{2} | \Sigma|^n\left(| \Sigma| - 1\right)$,

because each node has $(|\Sigma| - 1) n$ neighbors, there are $|\Sigma^n|$ nodes, and each edge is counted twice so we need to divide the total by $2$. Note that $G$ is a regular graph (of degree $|\Sigma| - 1$) because all nodes have the same number of neighbors.

For example, let $n=3$ and let us consider the $3$-string $(\text{c}, \text{a}, \text{r})$, which has $(|\Sigma| - 1) n = 75$ neighbors. Out of these $75$ words, I could count $21$ words which actually mean something in the English language:

bar, ear, far, gar, jar, mar, oar, par, tar, war, yar, cor, cur, cab, cad, cam, can, cap, cat, caw, cay.

Please let me know if I forgot to include any word in this list. These words form a sub-graph of the graph of all $3$-strings. Note that the words

bar, car, ear, far, gar, jar, mar, oar, par, tar, war, yar

are all neighbors of one another, and hence, they form a clique of size $12$. The words

car, cor, cur

form a clique of size $3$. The words

cab, cad, cam, can, cap, car, cat, caw, cay

are all neighbors of one another as well, and thus they form a clique of size $9$. The subgraph formed by the word “car” and its neighbors which make sense in the English language is illustrated below, where the three cliques are easy to identify

[ graph generated with Graphviz ]

It is not practical to depict the whole graph of $3$-strings over alphabet $\Sigma$ since it contains $|\Sigma^3| = |\Sigma|^3 = 26^3 = 17576$ nodes and $|E| = 659100$ edges!

__________

Python code

In case you are wondering: no, I did not type the 26 symbols of the Latin alphabet in $\LaTeX$. That would have been way too cumbersome. All it took was one single line of Python code:


", ".join([("\text{" + chr(i) + "}") for i in range(ord('a'),ord('z')+1)])



Suppose we would like to find all strings within Hamming distance of $1$ of string $(\text{c}, \text{a}, \text{t})$. The following Python script automates the process:

def MutateString(s, alphabet, index):
li = []
for ch in alphabet:
if ch != s[index]:
li.append(s[:index] + ch + s[index+1:])
return li

# build Latin alphabet (lowercase symbols only)
Sigma = [chr(i) for i in range(ord('a'),ord('z')+1)]

# obtain list of mutated strings
print "Mutate 1st symbol:\n" + ", ".join(MutateString('cat', Sigma, 0)) + '\n'
print "Mutate 2nd symbol:\n" + ", ".join(MutateString('cat', Sigma, 1)) + '\n'
print "Mutate 3rd symbol:\n" + ", ".join(MutateString('cat', Sigma, 2)) + '\n'


The graph of the word “car” and its neighbors was generated also by a Python script using the GvGen library.

def HammingDistance(s1, s2):
"""Computes the Hamming distance between strings s1 and s2"""
assert len(s1) == len(s2)
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

import gvgen

# list of words
words = ['car', 'bar', 'ear', 'far', 'gar', 'jar', 'mar', 'oar', 'par', 'tar', 'war', 'yar', 'cor', 'cur', 'cab', 'cad', 'cam', 'can', 'cap', 'cat', 'caw', 'cay']

# creates the new graph instance
graph = gvgen.GvGen()
graph.smart_mode = 1

# creates graph nodes
nodes = []
for elem in words:
nodes.append(graph.newItem(elem))

# creates graph edges
for i in nodes:
for j in nodes:

# extracts the labels of each node
node1 = i['properties']['label']
node2 = j['properties']['label']

# links nodes i and j if the Hamming
# distance between their labels is 1
if HammingDistance(node1,node2) == 1:

# outputs the Graphviz code to a .dot file
filename = "test.dot"
FileOut = open(filename,"w")
graph.dot(FileOut)
FileOut.close()


The script outputs a DOT file which can be read by Graphviz. In particular, I used GVedit (with layout engine = circo) to generate the diagram depicted above.

Remark: these snippets of code worked with Python 2.5.2. I can not guarantee that they work with other versions.

### Colored Water Figures

January 1, 2009

Fotoopa is a photographer from Belgium who has shot amazing high speed photos of dancing and splashing colored water drops. Here’s an example of a droplet captured in mid-splash in vivid detail:

[ photo courtesy of fotoopa ]

Apart from the standard photographic equipment (camera, lenses, multiple flashes), fotoopa uses some high-tech hardware more commonly found on a physics research lab: lasers, LEDs, homemade electronics, power amplifiers, microcontrollers, FPGAs. It’s very, very impressive indeed.