Consider a finite set with
distinct elements which we will call symbols. We will call set
an alphabet. A word (also known as string) of length
over set
is formed by constructing an
-tuple of symbols from alphabet
where for all
. A string of length
will be called an
-string. The set of all
-strings is given by the cartesian product
.
For example, in coding theory one usually works with binary words, i.e., words over alphabet . The set of all binary words of length
is
, whose cardinality is
.
In this post we will work only with words over the alphabet
which is known as the Latin alphabet, and whose 26 symbols are known as letters. For example, and
are represented by the 3-strings
and
, respectively, where
.
__________
Hamming distance
Given two strings of equal length, the Hamming distance between them is the number of positions for which the corresponding symbols are different. In other words, the Hamming distance between two strings of equal length is the minimum number of symbol substitutions required to change one string into the other.
Let function be defined by
where function is defined by
then is the Hamming distance between
-strings
and
. For example, if
and
, then we have
because
and
only differ in one symbol (the last one).
Suppose we are given an -string over the Latin alphabet
. If we replace the letter at position
with another letter in
, where
, we can obtain a total of
new
-strings, each within Hamming distance
of the original
-string. Let us call these
-strings neighbors of the original
-string. Thus, an
-string has a total of
neighbors.
__________
Word graphs
If two -strings are neighbors of one another when their Hamming distance is equal to
, then a graphical approach would be the natural way to go. Let
be an undirected graph whose set of nodes is
(set of all
-strings over alphabet
), and whose set of edges is
. Two nodes are connected by an edge if their Hamming distance is equal to
. Hence the total number of nodes in
is
, and the total number of edges in
is
,
because each node has neighbors, there are
nodes, and each edge is counted twice so we need to divide the total by
. Note that
is a regular graph (of degree
) because all nodes have the same number of neighbors.
For example, let and let us consider the
-string
, which has
neighbors. Out of these
words, I could count
words which actually mean something in the English language:
bar, ear, far, gar, jar, mar, oar, par, tar, war, yar, cor, cur, cab, cad, cam, can, cap, cat, caw, cay.
Please let me know if I forgot to include any word in this list. These words form a sub-graph of the graph of all -strings. Note that the words
bar, car, ear, far, gar, jar, mar, oar, par, tar, war, yar
are all neighbors of one another, and hence, they form a clique of size . The words
car, cor, cur
form a clique of size . The words
cab, cad, cam, can, cap, car, cat, caw, cay
are all neighbors of one another as well, and thus they form a clique of size . The subgraph formed by the word “car” and its neighbors which make sense in the English language is illustrated below, where the three cliques are easy to identify
[ graph generated with Graphviz ]
It is not practical to depict the whole graph of -strings over alphabet
since it contains
nodes and
edges!
__________
Python code
In case you are wondering: no, I did not type the 26 symbols of the Latin alphabet in . That would have been way too cumbersome. All it took was one single line of Python code:
", ".join([("\text{" + chr(i) + "}") for i in range(ord('a'),ord('z')+1)])
Suppose we would like to find all strings within Hamming distance of of string
. The following Python script automates the process:
def MutateString(s, alphabet, index):
li = []
for ch in alphabet:
if ch != s[index]:
li.append(s[:index] + ch + s[index+1:])
return li
# build Latin alphabet (lowercase symbols only)
Sigma = [chr(i) for i in range(ord('a'),ord('z')+1)]
# obtain list of mutated strings
print "Mutate 1st symbol:\n" + ", ".join(MutateString('cat', Sigma, 0)) + '\n'
print "Mutate 2nd symbol:\n" + ", ".join(MutateString('cat', Sigma, 1)) + '\n'
print "Mutate 3rd symbol:\n" + ", ".join(MutateString('cat', Sigma, 2)) + '\n'
The graph of the word “car” and its neighbors was generated also by a Python script using the GvGen library.
def HammingDistance(s1, s2):
"""Computes the Hamming distance between strings s1 and s2"""
assert len(s1) == len(s2)
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
import gvgen
# list of words
words = ['car', 'bar', 'ear', 'far', 'gar', 'jar', 'mar', 'oar', 'par', 'tar', 'war', 'yar', 'cor', 'cur', 'cab', 'cad', 'cam', 'can', 'cap', 'cat', 'caw', 'cay']
# creates the new graph instance
graph = gvgen.GvGen()
graph.smart_mode = 1
# creates graph nodes
nodes = []
for elem in words:
nodes.append(graph.newItem(elem))
# creates graph edges
for i in nodes:
for j in nodes:
# extracts the labels of each node
node1 = i['properties']['label']
node2 = j['properties']['label']
# links nodes i and j if the Hamming
# distance between their labels is 1
if HammingDistance(node1,node2) == 1:
graph.newLink(i,j)
# outputs the Graphviz code to a .dot file
filename = "test.dot"
FileOut = open(filename,"w")
graph.dot(FileOut)
FileOut.close()
The script outputs a DOT file which can be read by Graphviz. In particular, I used GVedit (with layout engine = circo) to generate the diagram depicted above.
Remark: these snippets of code worked with Python 2.5.2. I can not guarantee that they work with other versions.

