Archive for January, 2009

Distance between two words

January 31, 2009

Consider a finite set \Sigma with |\Sigma| distinct elements which we will call symbols. We will call set \Sigma an alphabet. A word (also known as string) of length n over set \Sigma is formed by constructing an n-tuple of symbols from alphabet \Sigma

w = (w_1, w_2, w_3, \ldots, w_n)

where w_i \in \Sigma for all i \in \{1, 2, \ldots, n\}. A string of length n will be called an n-string. The set of all n-strings is given by the cartesian product

\Sigma^n = \Sigma \times \Sigma \times \ldots \times \Sigma = \displaystyle\prod_{i=1}^n \Sigma.

For example, in coding theory one usually works with binary words, i.e., words over alphabet \mathbb{F}_2 = \{0,1\}. The set of all binary words of length n is \mathbb{F}_2^n = \{0,1\}^n, whose cardinality is | \mathbb{F}_2^n | = 2^n.

In this post we will work only with words over the alphabet

\Sigma = \{\text{a} ,\text{b} ,\text{c} ,\text{d} ,\text{e} ,\text{f} ,\text{g} ,\text{h} ,\text{i} ,\text{j} ,\text{k} ,\text{l} ,\text{m} ,\text{n} ,\text{o} ,\text{p} ,\text{q} ,\text{r} ,\text{s} ,\text{t} ,\text{u} ,\text{v} ,\text{w} ,\text{x} ,\text{y} ,\text{z}\}

which is known as the Latin alphabet, and whose 26 symbols are known as letters. For example,  \textbf{cat} and \textbf{car} are represented by the 3-strings s_1 = (\text{c}, \text{a}, \text{t}) and s_2 = (\text{c}, \text{a}, \text{r}), respectively, where s_1, s_2 \in \Sigma^3.

__________

Hamming distance

Given two strings of equal length, the Hamming distance between them is the number of positions for which the corresponding symbols are different. In other words, the Hamming distance between two strings of equal length is the minimum number of symbol substitutions required to change one string into the other.

Let function d_{H} : \Sigma^n \times \Sigma^n \rightarrow \mathbb{Z}_0^{+} be defined by

d_{H} (v, w) = \displaystyle \sum_{i=1}^n \mu \left( v_i, w_i \right)

where function \mu : \Sigma \times \Sigma \rightarrow \{0,1\} is defined by

\mu (a, b) = \displaystyle\left\{\begin{array}{rl} 0 & \text{if} \quad{} a = b\\ 1 & \text{if} \quad{} a \neq b\\ \end{array}\right.

then d_H (s_1, s_2) is the Hamming distance between n-strings s_1 and s_2. For example, if s_1 = (\text{c}, \text{a}, \text{t}) and s_2 = (\text{c}, \text{a}, \text{r}), then we have d_H (s_1, s_2) = 1 because s_1 and s_2 only differ in one symbol (the last one).

Suppose we are given an n-string over the Latin alphabet \Sigma. If we replace the letter at position i with another letter in \Sigma, where i \in \{1, 2, \ldots, n\}, we can obtain a total of |\Sigma| - 1 new n-strings, each within Hamming distance 1 of the original n-string. Let us call these n-strings neighbors of the original n-string. Thus, an n-string has a total of (|\Sigma| - 1) n neighbors.

__________

Word graphs

If two n-strings are neighbors of one another when their Hamming distance is equal to 1, then a graphical approach would be the natural way to go. Let G = (N,E) be an undirected graph whose set of nodes is N = \Sigma^n (set of all n-strings over alphabet \Sigma), and whose set of edges is E \subset \Sigma^n \times \Sigma^n. Two nodes are connected by an edge if their Hamming distance is equal to 1. Hence the total number of nodes in G is |N| = |\Sigma^n| = 26^n, and the total number of edges in G is

|E| = \displaystyle\frac{n | \Sigma^n|}{2} \left(| \Sigma| - 1\right) = \displaystyle\frac{n}{2} | \Sigma|^n\left(| \Sigma| - 1\right),

because each node has (|\Sigma| - 1) n neighbors, there are |\Sigma^n| nodes, and each edge is counted twice so we need to divide the total by 2. Note that G is a regular graph (of degree |\Sigma| - 1) because all nodes have the same number of neighbors.

For example, let n=3 and let us consider the 3-string (\text{c}, \text{a}, \text{r}), which has (|\Sigma| - 1) n = 75 neighbors. Out of these 75 words, I could count 21 words which actually mean something in the English language:

bar, ear, far, gar, jar, mar, oar, par, tar, war, yar, cor, cur, cab, cad, cam, can, cap, cat, caw, cay.

Please let me know if I forgot to include any word in this list. These words form a sub-graph of the graph of all 3-strings. Note that the words

bar, car, ear, far, gar, jar, mar, oar, par, tar, war, yar

are all neighbors of one another, and hence, they form a clique of size 12. The words

car, cor, cur

form a clique of size 3. The words

cab, cad, cam, can, cap, car, cat, caw, cay

are all neighbors of one another as well, and thus they form a clique of size 9. The subgraph formed by the word “car” and its neighbors which make sense in the English language is illustrated below, where the three cliques are easy to identify

Undirected Word Graph with 3 Cliques

[ graph generated with Graphviz ]

It is not practical to depict the whole graph of 3-strings over alphabet \Sigma since it contains |\Sigma^3| = |\Sigma|^3 = 26^3 = 17576 nodes and |E| = 659100 edges!

__________

Python code

In case you are wondering: no, I did not type the 26 symbols of the Latin alphabet in \LaTeX. That would have been way too cumbersome. All it took was one single line of Python code:


", ".join([("\text{" + chr(i) + "}") for i in range(ord('a'),ord('z')+1)])

Suppose we would like to find all strings within Hamming distance of 1 of string (\text{c}, \text{a}, \text{t}). The following Python script automates the process:

def MutateString(s, alphabet, index):
    li = []
    for ch in alphabet:
        if ch != s[index]:
            li.append(s[:index] + ch + s[index+1:])
    return li

# build Latin alphabet (lowercase symbols only)
Sigma = [chr(i) for i in range(ord('a'),ord('z')+1)]

# obtain list of mutated strings
print "Mutate 1st symbol:\n" + ", ".join(MutateString('cat', Sigma, 0)) + '\n'
print "Mutate 2nd symbol:\n" + ", ".join(MutateString('cat', Sigma, 1)) + '\n'
print "Mutate 3rd symbol:\n" + ", ".join(MutateString('cat', Sigma, 2)) + '\n'

The graph of the word “car” and its neighbors was generated also by a Python script using the GvGen library.

def HammingDistance(s1, s2):
    """Computes the Hamming distance between strings s1 and s2"""
    assert len(s1) == len(s2)
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

import gvgen

# list of words
words = ['car', 'bar', 'ear', 'far', 'gar', 'jar', 'mar', 'oar', 'par', 'tar', 'war', 'yar', 'cor', 'cur', 'cab', 'cad', 'cam', 'can', 'cap', 'cat', 'caw', 'cay']

# creates the new graph instance
graph = gvgen.GvGen()
graph.smart_mode = 1

# creates graph nodes
nodes = []
for elem in words:
    nodes.append(graph.newItem(elem))

# creates graph edges
for i in nodes:
    for j in nodes:

        # extracts the labels of each node
        node1 = i['properties']['label']
        node2 = j['properties']['label']

        # links nodes i and j if the Hamming
        # distance between their labels is 1
        if HammingDistance(node1,node2) == 1:
            graph.newLink(i,j)

# outputs the Graphviz code to a .dot file
filename = "test.dot"
FileOut = open(filename,"w")
graph.dot(FileOut)
FileOut.close()

The script outputs a DOT file which can be read by Graphviz. In particular, I used GVedit (with layout engine = circo) to generate the diagram depicted above.

Remark: these snippets of code worked with Python 2.5.2. I can not guarantee that they work with other versions.

Colored Water Figures

January 1, 2009

Fotoopa is a photographer from Belgium who has shot amazing high speed photos of dancing and splashing colored water drops. Here’s an example of a droplet captured in mid-splash in vivid detail:

Colored Waterfigure - by Fotoopa

[ photo courtesy of fotoopa ]

Apart from the standard photographic equipment (camera, lenses, multiple flashes), fotoopa uses some high-tech hardware more commonly found on a physics research lab: lasers, LEDs, homemade electronics, power amplifiers, microcontrollers, FPGAs. It’s very, very impressive indeed.


Follow

Get every new post delivered to your Inbox.

Join 47 other followers