Consider a finite set with
distinct elements which we will call symbols. We will call set
an alphabet. A word (also known as string) of length
over set
is formed by constructing an
-tuple of symbols from alphabet
where for all
. A string of length
will be called an
-string. The set of all
-strings is given by the cartesian product
.
For example, in coding theory one usually works with binary words, i.e., words over alphabet . The set of all binary words of length
is
, whose cardinality is
.
In this post we will work only with words over the alphabet
which is known as the Latin alphabet, and whose 26 symbols are known as letters. For example, and
are represented by the 3-strings
and
, respectively, where
.
__________
Hamming distance
Given two strings of equal length, the Hamming distance between them is the number of positions for which the corresponding symbols are different. In other words, the Hamming distance between two strings of equal length is the minimum number of symbol substitutions required to change one string into the other.
Let function be defined by
where function is defined by
then is the Hamming distance between
-strings
and
. For example, if
and
, then we have
because
and
only differ in one symbol (the last one).
Suppose we are given an -string over the Latin alphabet
. If we replace the letter at position
with another letter in
, where
, we can obtain a total of
new
-strings, each within Hamming distance
of the original
-string. Let us call these
-strings neighbors of the original
-string. Thus, an
-string has a total of
neighbors.
__________
Word graphs
If two -strings are neighbors of one another when their Hamming distance is equal to
, then a graphical approach would be the natural way to go. Let
be an undirected graph whose set of nodes is
(set of all
-strings over alphabet
), and whose set of edges is
. Two nodes are connected by an edge if their Hamming distance is equal to
. Hence the total number of nodes in
is
, and the total number of edges in
is
,
because each node has neighbors, there are
nodes, and each edge is counted twice so we need to divide the total by
. Note that
is a regular graph (of degree
) because all nodes have the same number of neighbors.
For example, let and let us consider the
-string
, which has
neighbors. Out of these
words, I could count
words which actually mean something in the English language:
bar, ear, far, gar, jar, mar, oar, par, tar, war, yar, cor, cur, cab, cad, cam, can, cap, cat, caw, cay.
Please let me know if I forgot to include any word in this list. These words form a sub-graph of the graph of all -strings. Note that the words
bar, car, ear, far, gar, jar, mar, oar, par, tar, war, yar
are all neighbors of one another, and hence, they form a clique of size . The words
car, cor, cur
form a clique of size . The words
cab, cad, cam, can, cap, car, cat, caw, cay
are all neighbors of one another as well, and thus they form a clique of size . The subgraph formed by the word “car” and its neighbors which make sense in the English language is illustrated below, where the three cliques are easy to identify
[ graph generated with Graphviz ]
It is not practical to depict the whole graph of -strings over alphabet
since it contains
nodes and
edges!
__________
Python code
In case you are wondering: no, I did not type the 26 symbols of the Latin alphabet in . That would have been way too cumbersome. All it took was one single line of Python code:
", ".join([("\text{" + chr(i) + "}") for i in range(ord('a'),ord('z')+1)])
Suppose we would like to find all strings within Hamming distance of of string
. The following Python script automates the process:
def MutateString(s, alphabet, index):
li = []
for ch in alphabet:
if ch != s[index]:
li.append(s[:index] + ch + s[index+1:])
return li
# build Latin alphabet (lowercase symbols only)
Sigma = [chr(i) for i in range(ord('a'),ord('z')+1)]
# obtain list of mutated strings
print "Mutate 1st symbol:\n" + ", ".join(MutateString('cat', Sigma, 0)) + '\n'
print "Mutate 2nd symbol:\n" + ", ".join(MutateString('cat', Sigma, 1)) + '\n'
print "Mutate 3rd symbol:\n" + ", ".join(MutateString('cat', Sigma, 2)) + '\n'
The graph of the word “car” and its neighbors was generated also by a Python script using the GvGen library.
def HammingDistance(s1, s2):
"""Computes the Hamming distance between strings s1 and s2"""
assert len(s1) == len(s2)
return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))
import gvgen
# list of words
words = ['car', 'bar', 'ear', 'far', 'gar', 'jar', 'mar', 'oar', 'par', 'tar', 'war', 'yar', 'cor', 'cur', 'cab', 'cad', 'cam', 'can', 'cap', 'cat', 'caw', 'cay']
# creates the new graph instance
graph = gvgen.GvGen()
graph.smart_mode = 1
# creates graph nodes
nodes = []
for elem in words:
nodes.append(graph.newItem(elem))
# creates graph edges
for i in nodes:
for j in nodes:
# extracts the labels of each node
node1 = i['properties']['label']
node2 = j['properties']['label']
# links nodes i and j if the Hamming
# distance between their labels is 1
if HammingDistance(node1,node2) == 1:
graph.newLink(i,j)
# outputs the Graphviz code to a .dot file
filename = "test.dot"
FileOut = open(filename,"w")
graph.dot(FileOut)
FileOut.close()
The script outputs a DOT file which can be read by Graphviz. In particular, I used GVedit (with layout engine = circo) to generate the diagram depicted above.
Remark: these snippets of code worked with Python 2.5.2. I can not guarantee that they work with other versions.
Tags: Graphs, Graphviz, GvGen, Hamming distance, Python

February 1, 2009 at 07:29 |
Off the top of my head – “gar”.
( see http://en.wikipedia.org/wiki/Gar )
February 1, 2009 at 07:33 |
‘cur’ is also a neighbor of ‘car’.
You’ll need a third clique.
February 1, 2009 at 11:56 |
@ efrique
Thanks for the input! I had no idea that gar and cur existed. Using one of the Python scripts above, one can generate a list of all the neighbors of the word car. Here they are:
I have updated the post’s text, the Python script which generates the DOT file, and the graph’s diagram. Hopefully, there aren’t any more words unaccounted for!
February 1, 2009 at 12:36 |
More words that had been forgotten: mar, cer, cad, cap. Post, script and graph have been updated once again! If you guys spot any more words missing, please let me know.
February 10, 2009 at 05:54 |
Chambers dictionary lists 15 three letter words having the pattern ?ar, namely:
bar, car, ear, far, gar
jar, lar, mar, oar, par,
sa’r, tar, var, war, yar;
three having the pattern c?r: car, cor, cur,
and 11 having the pattern ca?, namely:
caa’, cab, cad, cam, can,
cap, car, ca’s, cat, caw, cay.
It doesn’t list cer, which I can only find as an abbreviation of Conditioned Emotional Response.
Chambers is the crossword solvers dictionary, and has an easy online interface (http://www.chambersharrap.co.uk/chambers/puzzles/word_wizards/wwizards.py/main ). I use Firefox’s keyword functionality to automatically submit words/word-patterns to it (i.e. I type ‘chambers ca?’ in the URL bar to automatically search for all three letters beginning ca) so you may be able to automate checking for words from your Python scripts.
February 11, 2009 at 19:28 |
GregB,
Thanks for the new list of words! I think I did find “cer” in some online dictionary and that it meant something in Old English. However, I can’t find it right now in other online dictionaries, so I will remove it from the list since abbreviations don’t count.
Given your new list, there should be a clique “car“, “cor“, “cur“. However, I could not find these words on the dictionary “var” (value-at-risk?), “sa’r” (synthetic aperture radar?), “caa’“, “ca’s“. Where did you find them? I will update the graph as soon as I have a confirmation that these words do indeed exist.
February 12, 2009 at 00:44 |
Unfortunately I’m in Malta at the moment so don’t have access to Chambers to check out those words. They may indeed be abbreviations. However, var is a unit of electrical power.
Sorting out what is a word and what isn’t a word is a remarkably difficult task — one of the things I’ve discovered from doing cryptic crosswords is that the most unlikely looking things can turn out to be real words, albeit never used outside of crosswords!
February 14, 2009 at 15:53 |
@ GregB
I do wholeheartedly agree that sorting out what is and what is not a word is difficult! Some rules are needed. In my book, abbreviations don’t count as words. I don’t consider “var” a word since it’s kind of an abbreviation. Super-technical terms are out, too. Old English is OK just as long as it does not sound too awkward.
From you list, I added four new words to my list: “yar“, “cor“, “caw“, “cay“. I updated the post, the graph, and the 3rd Python script. The clique sizes are now 12, 3 and 9.
February 25, 2009 at 18:44 |
Note that the Latin alphabet could also have been built using the following lines of code:
June 30, 2010 at 23:38 |
Levenshtein distance would yield an even bigger graph :) and be much harder to compute.
You could modify your python to do the whole job though; input string length n and have it generate all candidate strings and filter by a dictionary file and generate the graph. (obviously there are optimisations; you wouldn’t want the entire set sitting in memory before you filtered it, you’d filter as you went along. In fact that may be quite an interesting excercise in writing optimised code)
July 1, 2010 at 22:21 |
Unsurprisingly, my ultimate goal is to compute the Levenshtein distance, and automatically plot the corresponding graphs using GraphViz. However, since the graphs would be much, much bigger, visualizing them would also be much, much harder.
In fact, what got me interested in the Levenshtein distance was the following. Suppose you want to search for info on the Levenshtein distance, and that you google Levenshtien. Google will most likely realize that you made a typo and point you to Levenshtein. That rather primitive form of AI is interesting.
July 2, 2010 at 12:00 |
Thanks for the reply. Your blog is very impressive.
As an aside, I read once about using Levenshtein on a metaphone as well as on the raw word itself to produce better quality typo correction.
July 26, 2010 at 16:44 |
An interesting analysis of text-distance matching algorithms has been done by Cohen et al. of Carnegie Mellon University.
Their research suggests that the Jaro-Winkler text-distance algorithm is more practical than the hamming distance algorithm for most real-world applications measuring edit-distance, in part due to its ability to reasonably measure text-distance of words of variable lengths.
There are a great number of alternative text-distance algorithms. It would be nice to see more empirical research on the subject.
January 13, 2012 at 09:00 |
>>> import string
>>> string.lowercase
‘abcdefghijklmnopqrstuvwxyz’