Distance between two words

Consider a finite set \Sigma with |\Sigma| distinct elements which we will call symbols. We will call set \Sigma an alphabet. A word (also known as string) of length n over set \Sigma is formed by constructing an n-tuple of symbols from alphabet \Sigma

w = (w_1, w_2, w_3, \ldots, w_n)

where w_i \in \Sigma for all i \in \{1, 2, \ldots, n\}. A string of length n will be called an n-string. The set of all n-strings is given by the cartesian product

\Sigma^n = \Sigma \times \Sigma \times \ldots \times \Sigma = \displaystyle\prod_{i=1}^n \Sigma.

For example, in coding theory one usually works with binary words, i.e., words over alphabet \mathbb{F}_2 = \{0,1\}. The set of all binary words of length n is \mathbb{F}_2^n = \{0,1\}^n, whose cardinality is | \mathbb{F}_2^n | = 2^n.

In this post we will work only with words over the alphabet

\Sigma = \{\text{a} ,\text{b} ,\text{c} ,\text{d} ,\text{e} ,\text{f} ,\text{g} ,\text{h} ,\text{i} ,\text{j} ,\text{k} ,\text{l} ,\text{m} ,\text{n} ,\text{o} ,\text{p} ,\text{q} ,\text{r} ,\text{s} ,\text{t} ,\text{u} ,\text{v} ,\text{w} ,\text{x} ,\text{y} ,\text{z}\}

which is known as the Latin alphabet, and whose 26 symbols are known as letters. For example,  \textbf{cat} and \textbf{car} are represented by the 3-strings s_1 = (\text{c}, \text{a}, \text{t}) and s_2 = (\text{c}, \text{a}, \text{r}), respectively, where s_1, s_2 \in \Sigma^3.

__________

Hamming distance

Given two strings of equal length, the Hamming distance between them is the number of positions for which the corresponding symbols are different. In other words, the Hamming distance between two strings of equal length is the minimum number of symbol substitutions required to change one string into the other.

Let function d_{H} : \Sigma^n \times \Sigma^n \rightarrow \mathbb{Z}_0^{+} be defined by

d_{H} (v, w) = \displaystyle \sum_{i=1}^n \mu \left( v_i, w_i \right)

where function \mu : \Sigma \times \Sigma \rightarrow \{0,1\} is defined by

\mu (a, b) = \displaystyle\left\{\begin{array}{rl} 0 & \text{if} \quad{} a = b\\ 1 & \text{if} \quad{} a \neq b\\ \end{array}\right.

then d_H (s_1, s_2) is the Hamming distance between n-strings s_1 and s_2. For example, if s_1 = (\text{c}, \text{a}, \text{t}) and s_2 = (\text{c}, \text{a}, \text{r}), then we have d_H (s_1, s_2) = 1 because s_1 and s_2 only differ in one symbol (the last one).

Suppose we are given an n-string over the Latin alphabet \Sigma. If we replace the letter at position i with another letter in \Sigma, where i \in \{1, 2, \ldots, n\}, we can obtain a total of |\Sigma| - 1 new n-strings, each within Hamming distance 1 of the original n-string. Let us call these n-strings neighbors of the original n-string. Thus, an n-string has a total of (|\Sigma| - 1) n neighbors.

__________

Word graphs

If two n-strings are neighbors of one another when their Hamming distance is equal to 1, then a graphical approach would be the natural way to go. Let G = (N,E) be an undirected graph whose set of nodes is N = \Sigma^n (set of all n-strings over alphabet \Sigma), and whose set of edges is E \subset \Sigma^n \times \Sigma^n. Two nodes are connected by an edge if their Hamming distance is equal to 1. Hence the total number of nodes in G is |N| = |\Sigma^n| = 26^n, and the total number of edges in G is

|E| = \displaystyle\frac{n | \Sigma^n|}{2} \left(| \Sigma| - 1\right) = \displaystyle\frac{n}{2} | \Sigma|^n\left(| \Sigma| - 1\right),

because each node has (|\Sigma| - 1) n neighbors, there are |\Sigma^n| nodes, and each edge is counted twice so we need to divide the total by 2. Note that G is a regular graph (of degree |\Sigma| - 1) because all nodes have the same number of neighbors.

For example, let n=3 and let us consider the 3-string (\text{c}, \text{a}, \text{r}), which has (|\Sigma| - 1) n = 75 neighbors. Out of these 75 words, I could count 21 words which actually mean something in the English language:

bar, ear, far, gar, jar, mar, oar, par, tar, war, yar, cor, cur, cab, cad, cam, can, cap, cat, caw, cay.

Please let me know if I forgot to include any word in this list. These words form a sub-graph of the graph of all 3-strings. Note that the words

bar, car, ear, far, gar, jar, mar, oar, par, tar, war, yar

are all neighbors of one another, and hence, they form a clique of size 12. The words

car, cor, cur

form a clique of size 3. The words

cab, cad, cam, can, cap, car, cat, caw, cay

are all neighbors of one another as well, and thus they form a clique of size 9. The subgraph formed by the word “car” and its neighbors which make sense in the English language is illustrated below, where the three cliques are easy to identify

Undirected Word Graph with 3 Cliques

[ graph generated with Graphviz ]

It is not practical to depict the whole graph of 3-strings over alphabet \Sigma since it contains |\Sigma^3| = |\Sigma|^3 = 26^3 = 17576 nodes and |E| = 659100 edges!

__________

Python code

In case you are wondering: no, I did not type the 26 symbols of the Latin alphabet in \LaTeX. That would have been way too cumbersome. All it took was one single line of Python code:


", ".join([("\text{" + chr(i) + "}") for i in range(ord('a'),ord('z')+1)])

Suppose we would like to find all strings within Hamming distance of 1 of string (\text{c}, \text{a}, \text{t}). The following Python script automates the process:

def MutateString(s, alphabet, index):
    li = []
    for ch in alphabet:
        if ch != s[index]:
            li.append(s[:index] + ch + s[index+1:])
    return li

# build Latin alphabet (lowercase symbols only)
Sigma = [chr(i) for i in range(ord('a'),ord('z')+1)]

# obtain list of mutated strings
print "Mutate 1st symbol:\n" + ", ".join(MutateString('cat', Sigma, 0)) + '\n'
print "Mutate 2nd symbol:\n" + ", ".join(MutateString('cat', Sigma, 1)) + '\n'
print "Mutate 3rd symbol:\n" + ", ".join(MutateString('cat', Sigma, 2)) + '\n'

The graph of the word “car” and its neighbors was generated also by a Python script using the GvGen library.

def HammingDistance(s1, s2):
    """Computes the Hamming distance between strings s1 and s2"""
    assert len(s1) == len(s2)
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

import gvgen

# list of words
words = ['car', 'bar', 'ear', 'far', 'gar', 'jar', 'mar', 'oar', 'par', 'tar', 'war', 'yar', 'cor', 'cur', 'cab', 'cad', 'cam', 'can', 'cap', 'cat', 'caw', 'cay']

# creates the new graph instance
graph = gvgen.GvGen()
graph.smart_mode = 1

# creates graph nodes
nodes = []
for elem in words:
    nodes.append(graph.newItem(elem))

# creates graph edges
for i in nodes:
    for j in nodes:

        # extracts the labels of each node
        node1 = i['properties']['label']
        node2 = j['properties']['label']

        # links nodes i and j if the Hamming
        # distance between their labels is 1
        if HammingDistance(node1,node2) == 1:
            graph.newLink(i,j)

# outputs the Graphviz code to a .dot file
filename = "test.dot"
FileOut = open(filename,"w")
graph.dot(FileOut)
FileOut.close()

The script outputs a DOT file which can be read by Graphviz. In particular, I used GVedit (with layout engine = circo) to generate the diagram depicted above.

Remark: these snippets of code worked with Python 2.5.2. I can not guarantee that they work with other versions.

Tags: , , , ,

14 Responses to “Distance between two words”

  1. efrique Says:

    Off the top of my head – “gar”.

    ( see http://en.wikipedia.org/wiki/Gar )

  2. efrique Says:

    ‘cur’ is also a neighbor of ‘car’.

    You’ll need a third clique.

  3. Rod Carvalho Says:

    @ efrique

    Thanks for the input! I had no idea that gar and cur existed. Using one of the Python scripts above, one can generate a list of all the neighbors of the word car. Here they are:

    Mutate 1st symbol:
    aar, bar, dar, ear, far, gar, har, iar, jar, kar, lar, mar, nar, oar, par, qar, rar, sar, tar, uar, var, war, xar, yar, zar
    
    Mutate 2nd symbol:
    cbr, ccr, cdr, cer, cfr, cgr, chr, cir, cjr, ckr, clr, cmr, cnr, cor, cpr, cqr, crr, csr, ctr, cur, cvr, cwr, cxr, cyr, czr
    
    Mutate 3rd symbol:
    caa, cab, cac, cad, cae, caf, cag, cah, cai, caj, cak, cal, cam, can, cao, cap, caq, cas, cat, cau, cav, caw, cax, cay, caz
    

    I have updated the post’s text, the Python script which generates the DOT file, and the graph’s diagram. Hopefully, there aren’t any more words unaccounted for!

  4. Rod Carvalho Says:

    More words that had been forgotten: mar, cer, cad, cap. Post, script and graph have been updated once again! If you guys spot any more words missing, please let me know.

  5. GregB Says:

    Chambers dictionary lists 15 three letter words having the pattern ?ar, namely:
    bar, car, ear, far, gar
    jar, lar, mar, oar, par,
    sa’r, tar, var, war, yar;
    three having the pattern c?r: car, cor, cur,
    and 11 having the pattern ca?, namely:
    caa’, cab, cad, cam, can,
    cap, car, ca’s, cat, caw, cay.

    It doesn’t list cer, which I can only find as an abbreviation of Conditioned Emotional Response.
    Chambers is the crossword solvers dictionary, and has an easy online interface (http://www.chambersharrap.co.uk/chambers/puzzles/word_wizards/wwizards.py/main ). I use Firefox’s keyword functionality to automatically submit words/word-patterns to it (i.e. I type ‘chambers ca?’ in the URL bar to automatically search for all three letters beginning ca) so you may be able to automate checking for words from your Python scripts.

  6. Rod Carvalho Says:

    GregB,

    Thanks for the new list of words! I think I did find “cer” in some online dictionary and that it meant something in Old English. However, I can’t find it right now in other online dictionaries, so I will remove it from the list since abbreviations don’t count.

    Given your new list, there should be a clique “car“, “cor“, “cur“. However, I could not find these words on the dictionary “var” (value-at-risk?), “sa’r” (synthetic aperture radar?), “caa’“, “ca’s“. Where did you find them? I will update the graph as soon as I have a confirmation that these words do indeed exist.

  7. GregB Says:

    Unfortunately I’m in Malta at the moment so don’t have access to Chambers to check out those words. They may indeed be abbreviations. However, var is a unit of electrical power.

    Sorting out what is a word and what isn’t a word is a remarkably difficult task — one of the things I’ve discovered from doing cryptic crosswords is that the most unlikely looking things can turn out to be real words, albeit never used outside of crosswords!

  8. Rod Carvalho Says:

    @ GregB

    I do wholeheartedly agree that sorting out what is and what is not a word is difficult! Some rules are needed. In my book, abbreviations don’t count as words. I don’t consider “var” a word since it’s kind of an abbreviation. Super-technical terms are out, too. Old English is OK just as long as it does not sound too awkward.

    From you list, I added four new words to my list: “yar“, “cor“, “caw“, “cay“. I updated the post, the graph, and the 3rd Python script. The clique sizes are now 12, 3 and 9.

  9. Rod Carvalho Says:

    Note that the Latin alphabet could also have been built using the following lines of code:

    import string
    Sigma = [ch for ch in string.ascii_lowercase]
    
  10. Howard Yeend Says:

    Levenshtein distance would yield an even bigger graph :) and be much harder to compute.

    You could modify your python to do the whole job though; input string length n and have it generate all candidate strings and filter by a dictionary file and generate the graph. (obviously there are optimisations; you wouldn’t want the entire set sitting in memory before you filtered it, you’d filter as you went along. In fact that may be quite an interesting excercise in writing optimised code)

    • Rod Carvalho Says:

      Unsurprisingly, my ultimate goal is to compute the Levenshtein distance, and automatically plot the corresponding graphs using GraphViz. However, since the graphs would be much, much bigger, visualizing them would also be much, much harder.

      In fact, what got me interested in the Levenshtein distance was the following. Suppose you want to search for info on the Levenshtein distance, and that you google Levenshtien. Google will most likely realize that you made a typo and point you to Levenshtein. That rather primitive form of AI is interesting.

      • Howard Yeend Says:

        Thanks for the reply. Your blog is very impressive.

        As an aside, I read once about using Levenshtein on a metaphone as well as on the raw word itself to produce better quality typo correction.

  11. Scott S. McCoy Says:

    An interesting analysis of text-distance matching algorithms has been done by Cohen et al. of Carnegie Mellon University.

    Their research suggests that the Jaro-Winkler text-distance algorithm is more practical than the hamming distance algorithm for most real-world applications measuring edit-distance, in part due to its ability to reasonably measure text-distance of words of variable lengths.

    There are a great number of alternative text-distance algorithms. It would be nice to see more empirical research on the subject.

  12. Matt Williamson Says:

    >>> import string
    >>> string.lowercase
    ‘abcdefghijklmnopqrstuvwxyz’

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 47 other followers