Posts Tagged ‘Python’

Experiments with the Python Imaging Library

July 14, 2012

Back in February 2009, I spent some time playing with the Python Imaging Library (PIL). I even started a blog post on my experiments with PIL, but I never finished it. I will finish it now. Be warned: this post will have a very experimental flavor.

Since we are all acquainted with Lena, as a test image I will use the following photograph of the late Ruslana Korshunova (1987-2008):

The flowers and the hair create plenty of high-frequency content. Let us now use the Python Imaging Library (PIL) to convert the image to B&W, invert the image, and apply several elementary filters to it.

__________

Python script

The following Python script opens the image above and performs the aforementioned operations on it:

import Image
import ImageChops
import ImageFilter

# open image and print format
Im = Image.open('Ruslana.jpg')
print Im.format, Im.size, Im.mode

# convert to black & white
ImBW = Im.convert('1')
ImBW.save('Ruslana_BW.bmp',"BMP")

# invert image (obtain negative)
ImInv = ImageChops.invert(Im)
ImInv.save('Ruslana_Inv.jpg',"JPEG")

# apply BLUR filter
ImBlur = Im.filter(ImageFilter.BLUR)
ImBlur.save('Ruslana_BLUR.jpg',"JPEG")

# apply CONTOUR filter
ImContour = Im.filter(ImageFilter.CONTOUR)
ImContour.save('Ruslana_CONTOUR.jpg',"JPEG")

# apply DETAIL filter
ImDetail = Im.filter(ImageFilter.DETAIL)
ImDetail.save('Ruslana_DETAIL.jpg',"JPEG")

# apply EDGE_ENHANCE filter
ImEH = Im.filter(ImageFilter.EDGE_ENHANCE)
ImEH.save('Ruslana_EDGE_ENHANCE.jpg',"JPEG")

# apply EDGE_ENHANCE_MORE filter
ImEHM = Im.filter(ImageFilter.EDGE_ENHANCE_MORE)
ImEHM.save('Ruslana_EHM.jpg',"JPEG")

# apply EMBOSS filter
ImEmb = Im.filter(ImageFilter.EMBOSS)
ImEmb.save('Ruslana_EMBOSS.jpg',"JPEG")

# apply FIND_EDGES filter
ImEdges = Im.filter(ImageFilter.FIND_EDGES)
ImEdges = ImEdges.save('Ruslana_FIND_EDGES.jpg',"JPEG")

# apply SMOOTH filter
ImSmooth = Im.filter(ImageFilter.SMOOTH)
ImSmooth = ImSmooth.save('Ruslana_SMOOTH.jpg',"JPEG")

# apply SMOOTH_MORE filter
ImSmoothMore = Im.filter(ImageFilter.SMOOTH_MORE)
ImSmoothMore = ImSmoothMore.save('Ruslana_SMOOTH_MORE.jpg',"JPEG")

# apply SHARPEN filter
ImSharp = Im.filter(ImageFilter.SHARPEN)
ImSharp = ImSharp.save('Ruslana_SHARPEN.jpg',"JPEG")

Let us now see how the processed images look like.

__________

Results

Here’s the B&W image:

The script produced a monochromatic (1 bit per pixel) BMP file. Since WordPress.com does not allow one to upload BMP files, I converted it to JPEG, which greatly increased the size of the file. How ironic! In this case, lossy compression actually led to expansion!

The inverted (i.e., negative) image looks quite interesting:

It looks as though the EDGE_ENHANCE_MORE filter performs some form of amplification of the high-frequency content of the image:

Look how funny the flowers and her hair look! If we want to extract the actual edges, we can use the FIND_EDGES filter:

The other images produced by the Python script are not particularly interesting and, therefore, I have not posted them here.

__________

Concluding remarks

The Python Imaging Library (PIL) offers the capability to perform some very basic image processing. However, given Python’s lack of a matrix data structure, one cannot do that much. I believe the PIL is useful to process large collections of photos (to generate thumbnails, for example), not to perform 2-dimensional signal processing.

__________

Related:

Deciding the convexity of semialgebraic sets II

June 23, 2012

Let S_3 \subset \mathbb{R}^{3 \times 3} denote the set of real symmetric 3 \times 3 matrices. We introduce a matrix-valued function A: \mathbb{R}^3 \to S_3 defined by

A (x) = \left[\begin{array}{ccc} 1 & x_1 & x_2\\ x_1 & 1 & x_3\\ x_2 & x_3 & 1\end{array}\right]

The elliptope E_3 \subset \mathbb{R}^3 is defined by

E_3 = \left\{ x \in \mathbb{R}^3 \mid A (x) \succeq 0 \right\}

where A (x) \succeq 0 means that matrix A (x) is positive semidefinite. Since the elliptope E_3 is defined by the linear matrix inequality (LMI) A (x) \succeq 0, we can conclude that E_3 is a spectrahedron and, therefore, we have that E_3 is convex.

The E_3 elliptope (also known as “inflated tetrahedron”) is the yellow part (surface and interior) of the Cayley’s cubic surface plotted below

[ source ]

__________

E_3 is a semialgebraic set

Saying that A (x) is positive semidefinite is equivalent to saying that the 2^3 - 1 = 7 principal minors of A (x) are nonnegative. If A (x) \succeq 0, then the three 1st order principal minors (which are the diagonal entries of A (x)) lead to the following three (admittedly uninteresting) inequalities

1 \geq 0,     1 \geq 0,     1 \geq 0

which we can discard. The three 2nd order principal minors (determinants of 2 \times 2 submatrices of A (x)) and the 3rd order principal minor (the determinant of A (x)) can be computed using the following Python 2.5 + SymPy script:

from sympy import *

# symbolic variables
x1 = Symbol('x1')
x2 = Symbol('x2')
x3 = Symbol('x3')

# build matrix A(x)
A = Matrix([[1,x1,x2],
            [x1,1,x3],
            [x2,x3,1]])

print "\nMatrix A(x):"
print A

print "\n2nd order principal minors:"
print A.extract([0,1],[0,1]).det()
print A.extract([0,2],[0,2]).det()
print A.extract([1,2],[1,2]).det()

print "\n3rd order principal minor:"
print A.det()

which produces the following output:

Matrix A(x):
[ 1, x1, x2]
[x1,  1, x3]
[x2, x3,  1]

2nd order principal minors:
-x1**2 + 1
-x2**2 + 1
-x3**2 + 1

3rd order principal minor:
-x1**2 + 2*x1*x2*x3 - x2**2 - x3**2 + 1

Hence, the 2nd order principal minors yield the following inequalities

1 - x_1^2 \geq 0,     1 - x_2^2 \geq 0,     1 - x_3^2 \geq 0

which define the 3-dimensional cube [-1,1]^3 \subset \mathbb{R}^3. The 3rd order principal minor (i.e., the determinant) yields the inequality

1 + 2 x_1 x_2 x_3 - (x_1^2 + x_2^2 + x_3^3) \geq 0.

We can thus conclude that E_3 is a basic closed semialgebraic set.

__________

Deciding the convexity of E_3

In my last post, I used quantifier elimination to decide the convexity of two semialgebraic sets in \mathbb{R}^2. Let us now decide the convexity of E_3, which is of higher “complexity” than the sets I considered last time. What is the goal? The goal is to find out whether REDLOG can decide the convexity of E_3 in less than 10 minutes.

Based on the scripts in my previous post, we write the following REDUCE + REDLOG script to decide the convexity of E_3:

% decide the convexity of the elliptope E3

load_package redlog;
rlset ofsf;

% define predicates in antecedent
p1 := (1-x1^2>=0) and (1-x2^2>=0) and (1-x3^2>=0) and 
      (1-x1^2-x2^2-x3^2+2*x1*x2*x3>=0);
p2 := (1-y1^2>=0) and (1-y2^2>=0) and (1-y3^2>=0) and 
      (1-y1^2-y2^2-y3^2+2*y1*y2*y3>=0);
p3 := (theta>=0) and (theta<=1);

% define predicate in the consequent
z1 := theta*x1+(1-theta)*y1;
z2 := theta*x2+(1-theta)*y2;
q  := (1-z1^2>=0) and (1-z2^2>=0) and (1-z3^2>=0) and 
      (1-z1^2-z2^2-z3^2+2*z1*z2*z3>=0);

% define universal formula
phi := all({x1,x2,x3,y1,y2,y3,theta}, (p1 and p2 and p3) impl q);

% perform quantifier elimination
rlqe phi;

end;

After 30 minutes (!!!) waiting for the decision procedure to terminate, I killed the reduce.exe process (which was eating all my machine’s CPU cycles and RAM memory). This is disappointing…

Fair play

March 31, 2012

Last February’s Ponder This challenge was the following:

To decide who would play first in a human vs. computer game, the human suggested to flip a coin it produced from its pocket, stating “heads you win, tails I win.” The computer replied, “Humans are very devious. I know that your coin is not fair, and that the probability of heads on that coin is slightly less than one half. To make this perfectly fair, flip the coin repeatedly, and I win if the coin first shows heads on toss number 1, 14, 15, 19, 20, or 23. You win if it first shows heads on any other toss.”

If the computer is correct and p is the probability that the coin shows heads, give the best rational approximation to p with a denominator having fewer than 10 digits.

Here is the solution.

__________

My solution #1:

Let \mathbb{P} (H) = p be the probability that the coin will land with the heads side up. If the coin is biased in favor of the human player, then p < 1/2. The probability that the coin toss will first produce “heads” on the n-th trial (where n \geq 1) is (1-p)^{n-1} p. Let us define \mathcal{I} := \{1, 14, 15, 19, 20, 23\}. The probability that the computer will win is thus

\displaystyle \sum_{n \in \mathcal{I}} (1-p)^{n-1} p.

We now assume that the computer knows the exact value of p and, thus, is able to compute its probability of winning. Since the computer is interested in making this game “perfectly fair”, then his probability of winning should equal the human’s probability of winning, i.e., it should equal 1/2. Hence, we obtain the equation

p + (1-p)^{13} p + (1-p)^{14} p + (1-p)^{18} p + (1-p)^{19} p +(1-p)^{22} p = \frac{1}{2}

which we can solve using the following Python + SymPy script:

from sympy import *

# create symbolic variable
p = Symbol('p')

# solve polyomial equation
f = p + p*(1-p)**13+p*(1-p)**14+p*(1-p)**18+p*(1-p)**19+p*(1-p)**22
Solutions = solve(f-0.5,p)

# find (purely) real solutions
RealSols = filter(lambda z: im(z)==0, Solutions)

# print solutions
print "\nAll solutions:"
for sol in Solutions:
    print sol

# print real solutions
print "\nReal solutions:"
for sol in RealSols:
	print sol

This script outputs the following list of solutions:

All solutions:
0.334938061741437 - 1.03291350457235*I
0.334938061741437 + 1.03291350457235*I
1.97217279050061 + 0.143849563303172*I
1.50662666735084 - 0.797180194029411*I
1.92419089868505 + 0.454390584787364*I
0.0780115260218319 + 0.464383156442821*I
0.00197767511752256 - 0.101887310197153*I
1.27721993255011 + 0.951650156864179*I
1.27721993255184 - 0.951650156863701*I
0.247036637335883 - 0.624592516289942*I
0.625456509765409 - 0.845921918004056*I
0.0780115260218319 - 0.464383156442821*I
0.00197767511752256 + 0.101887310197153*I
0.991590970537258 - 0.923668626671437*I
1.79082570892033 - 0.599036471538201*I
1.92419089870934 - 0.454390584808249*I
1.79082570891289 + 0.599036471546724*I
0.499905242928390
0.6254565097654 + 0.845921918004058*I
0.991590970537348 + 0.923668626671824*I
0.247036637335883 + 0.624592516289942*I
1.50662666734992 + 0.797180194030292*I
1.97217279048996 - 0.14384956326771*I

Real solutions:
0.499905242928390

Hence, we have that the probability of heads is

p = 0.499905242928390.

We can convert this number to a ratio of integers in Haskell using function approxRational in the library Data.Ratio:

import Data.Ratio

-- probability of heads
p = 0.499905242928390

-- create (infinite) list of epsilons
epsilons = map (10**) [-1,-2..]

-- create (infinite) list of ratios
ratios = map (approxRational p) epsilons

-- list of ratios whose denominators have less than 10 digits
ratios' = takeWhile (\x -> denominator x < 10^10) ratios

Running the script above in GHCi, we obtain a list of ratios:

*Main Data.Ratio> ratios'
[1 % 2,1 % 2,1 % 2,1 % 2,2386 % 4773,2611 % 5223,2636 % 5273,
2638 % 5277,13189 % 26383,44843 % 89703,166183 % 332429,
393036 % 786221,1405961 % 2812455,6243733 % 12489833,
14906352 % 29818355,69693988 % 139414397]

and we finally obtain the following rational approximation

p \approx \displaystyle\frac{69693988}{139414397}

whose denominator has nine digits. According to the webmaster, the rational approximation I presented above is not the best one. Hence, let us now try to obtain a better one.

__________

My solution #2:

Most likely, the reason why I did not obtain the best rational approximation is that the real solution I obtained was not precise enough. Hence, let us use the following MATLAB command to solve the polynomial equation:

solve('p + p*(1-p)^13+p*(1-p)^14+p*(1-p)^18+p*(1-p)^19+p*(1-p)^22 = 0.5')

This command produces the following list of solutions:

ans =

 .19776751175225606651411342151624e-2-.10188731019715301414082803578298*i
 .19776751175225606651411342151624e-2+.10188731019715301414082803578298*i
 .78011526021831885333432975572559e-1-.46438315644282071005250209738747*i
 .78011526021831885333432975572559e-1+.46438315644282071005250209738747*i
    .24703663733588276238861533742100-.62459251628994218281562354280746*i
    .24703663733588276238861533742100+.62459251628994218281562354280746*i
    .33493806174143698737516495548579-1.0329135045723528674007841220342*i
    .33493806174143698737516495548579+1.0329135045723528674007841220342*i
                                        .49990524292838950452962194522516
    .62545650976540488339104798827398-.84592191800405765193777444233924*i
    .62545650976540488339104798827398+.84592191800405765193777444233924*i
    .99159097053730697830960408610308-.92366862667162608535284662817351*i
    .99159097053730697830960408610308+.92366862667162608535284662817351*i
    1.2772199325509563050241902871670-.95165015686389448554291677056666*i
    1.2772199325509563050241902871670+.95165015686389448554291677056666*i
    1.5066266673486985232104204448828-.79718019402868789409470957216463*i
    1.5066266673486985232104204448828+.79718019402868789409470957216463*i
    1.7908257089204728277307542114304-.59903647154154201043833983995659*i
    1.7908257089204728277307542114304+.59903647154154201043833983995659*i
    1.9241908987061350990877558129099-.45439058479360516453156947604463*i
    1.9241908987061350990877558129099+.45439058479360516453156947604463*i
    1.9721727904901564352190617939256-.14384956327517564484908725760533*i
    1.9721727904901564352190617939256+.14384956327517564484908725760533*i

We now obtain a much more precise real solution

p =0.49990524292838950452962194522516.

We modify the Haskell script, using the new value of p and a much finer list of epsilons (the precisions):

import Data.Ratio

-- probability of heads
p = 0.49990524292838950452962194522516

-- create (infinite) list of epsilons
epsilons = map ((1.005)**) [-1,-2..]

-- create (infinite) list of ratios
ratios = map (approxRational p) epsilons

-- list of ratios whose denominators have less than 10 digits
ratios' = takeWhile (\x -> denominator x < 10^10) ratios

Which yields the following rational approximation

p \approx \displaystyle\frac{61031369}{122085875}

whose denominator has nine digits. Although this rational approximation does not match the one in the official solution, I still got my name on the “people who answered correctly” list. Surprisingly, the value of p which I obtained using MATLAB is quite different from the value of p used by the webmaster

p = 0.499905242928506776611191007553.

Which value of p is “more correct”? I do not know. Do you?

Wythoff sequences

August 21, 2011

On November 20, 2007, the Putnam problem of the day was:

The set of all positive integers is the union of two disjoint subsets \{f(1), f(2), f(3), \dots\}, \{g(1), g(2), g(3), \dots\}, where f(1) < f(2) < f(3) < \dots, and g(1) < g(2) < g(3) < \dots, and g(n) = f(f(n))+1 for n=1, 2, 3,\dots. Determine f(240).

Back in November 2007 I could not solve the problem. Now, almost four years later, I will give it another try.

__________

My “solution”:

Let \mathbb{N} := \{1, 2, 3, \dots\} be the set of natural numbers (positive integers). We define F := \{f(n)\}_{n \in \mathbb{N}} and G := \{g(n)\}_{n \in \mathbb{N}}. Note that we must have F \cup G = \mathbb{N} and F \cap G = \emptyset.

Let f(1) = 1. Then, we obtain

g(1) = f(f(1)) + 1 = f(1) + 1 = 2.

Next, let f(2) = 3 and f(3) = 4. Then,

g(2) = f(f(2)) + 1 = f(3) + 1 = 5.

Let f(4) = 6. Then, we obtain

g(3) = f(f(3)) + 1 = f(4) + 1 = 7.

Next, let f(5) = 8 and f(6) = 9. Then,

g(4) = f(f(4)) + 1 = f(6) + 1 = 10.

Next, let f(7) = 11 and f(8) = 12. Then,

g(5) = f(f(5)) + 1 = f(8) + 1 = 13.

Let f(9) = 14. Then, we obtain

g(6) = f(f(6)) + 1 = f(9) + 1 = 15.

Next, let f(10) = 16 and f(11) = 17. Then,

g(7) = f(f(7)) + 1 = f(11) + 1 = 18.

Let f(12) = 19. Then, we obtain

g(8) = f(f(8)) + 1 = f(12) + 1 = 20.

Next, let f(13) = 21 and f(14) = 22. Then,

g(9) = f(f(9)) + 1 = f(14) + 1 = 23.

Let us stop iterating here. So far, we have

\left(f(n)\right)_{n \in \mathbb{N}} = \left(1, 3, 4, 6, 8, 9, 11, 12, 14, 16, 17, 19, 21, 22, \dots\right)

\left(g(n)\right)_{n \in \mathbb{N}} = \left(2, 5, 7, 10, 13, 15, 18, 20, 23, \dots\right).

Consulting the On-Line Encyclopedia of Integer Sequences (OEIS), we find out that \left(f(n)\right)_{n \in \mathbb{N}} and \left(g(n)\right)_{n \in \mathbb{N}} are the lower Wythoff (A000201) and upper Wythoff (A001950) sequences, respectively. These two are Beatty sequences generated by the golden ratio \varphi := \frac{1 + \sqrt{5}}{2}, as follows

f (n) = \lfloor n \varphi \rfloor,     g (n) = \lfloor n \varphi^2 \rfloor

where \lfloor \cdot \rfloor is the floor function. Recall that \varphi^2 = \varphi + 1, which allows us to write g (n) = \lfloor n (\varphi + 1) \rfloor.

The following Python script produces the first twenty terms of the lower and upper Wythoff sequences:

from math import floor
from math import sqrt

# golden ratio
phi = (1 + sqrt(5)) / 2.0

# define function f
def f (n):
    return int(floor(phi * n))

# define function g
def g (n):
    return int(floor((phi+1) * n))

print "Lower and upper Wythoff sequences:\n"
for n in range(1,21):
    print "f(%d) = %d \t g(%d) = %d" % (n, f(n), n, g(n))

This script produces the output:

Lower and upper Wythoff sequences:

f(1) = 1         g(1) = 2
f(2) = 3         g(2) = 5
f(3) = 4         g(3) = 7
f(4) = 6         g(4) = 10
f(5) = 8         g(5) = 13
f(6) = 9         g(6) = 15
f(7) = 11        g(7) = 18
f(8) = 12        g(8) = 20
f(9) = 14        g(9) = 23
f(10) = 16       g(10) = 26
f(11) = 17       g(11) = 28
f(12) = 19       g(12) = 31
f(13) = 21       g(13) = 34
f(14) = 22       g(14) = 36
f(15) = 24       g(15) = 39
f(16) = 25       g(16) = 41
f(17) = 27       g(17) = 44
f(18) = 29       g(18) = 47
f(19) = 30       g(19) = 49
f(20) = 32       g(20) = 52

Finally, we have f (240) = \lfloor 240 \varphi \rfloor = 388.

__________

I am not happy with this “solution”. I obviously cheated when I consulted the OEIS. Supposing one is not acquainted with Beatty sequences, is there any hope of solving the problem? If one happens to be acquainted with Beatty sequences and postulates that

f (n) = \lfloor n r \rfloor,     g (n) = \lfloor n s \rfloor

where \frac{1}{r} + \frac{1}{s} = 1, how does one arrive at the conclusion that r = \varphi and s = \varphi^2? In other words, what would a proper solution look like?

Distance between two words

January 31, 2009

Consider a finite set \Sigma with |\Sigma| distinct elements which we will call symbols. We will call set \Sigma an alphabet. A word (also known as string) of length n over set \Sigma is formed by constructing an n-tuple of symbols from alphabet \Sigma

w = (w_1, w_2, w_3, \ldots, w_n)

where w_i \in \Sigma for all i \in \{1, 2, \ldots, n\}. A string of length n will be called an n-string. The set of all n-strings is given by the cartesian product

\Sigma^n = \Sigma \times \Sigma \times \ldots \times \Sigma = \displaystyle\prod_{i=1}^n \Sigma.

For example, in coding theory one usually works with binary words, i.e., words over alphabet \mathbb{F}_2 = \{0,1\}. The set of all binary words of length n is \mathbb{F}_2^n = \{0,1\}^n, whose cardinality is | \mathbb{F}_2^n | = 2^n.

In this post we will work only with words over the alphabet

\Sigma = \{\text{a} ,\text{b} ,\text{c} ,\text{d} ,\text{e} ,\text{f} ,\text{g} ,\text{h} ,\text{i} ,\text{j} ,\text{k} ,\text{l} ,\text{m} ,\text{n} ,\text{o} ,\text{p} ,\text{q} ,\text{r} ,\text{s} ,\text{t} ,\text{u} ,\text{v} ,\text{w} ,\text{x} ,\text{y} ,\text{z}\}

which is known as the Latin alphabet, and whose 26 symbols are known as letters. For example,  \textbf{cat} and \textbf{car} are represented by the 3-strings s_1 = (\text{c}, \text{a}, \text{t}) and s_2 = (\text{c}, \text{a}, \text{r}), respectively, where s_1, s_2 \in \Sigma^3.

__________

Hamming distance

Given two strings of equal length, the Hamming distance between them is the number of positions for which the corresponding symbols are different. In other words, the Hamming distance between two strings of equal length is the minimum number of symbol substitutions required to change one string into the other.

Let function d_{H} : \Sigma^n \times \Sigma^n \rightarrow \mathbb{Z}_0^{+} be defined by

d_{H} (v, w) = \displaystyle \sum_{i=1}^n \mu \left( v_i, w_i \right)

where function \mu : \Sigma \times \Sigma \rightarrow \{0,1\} is defined by

\mu (a, b) = \displaystyle\left\{\begin{array}{rl} 0 & \text{if} \quad{} a = b\\ 1 & \text{if} \quad{} a \neq b\\ \end{array}\right.

then d_H (s_1, s_2) is the Hamming distance between n-strings s_1 and s_2. For example, if s_1 = (\text{c}, \text{a}, \text{t}) and s_2 = (\text{c}, \text{a}, \text{r}), then we have d_H (s_1, s_2) = 1 because s_1 and s_2 only differ in one symbol (the last one).

Suppose we are given an n-string over the Latin alphabet \Sigma. If we replace the letter at position i with another letter in \Sigma, where i \in \{1, 2, \ldots, n\}, we can obtain a total of |\Sigma| - 1 new n-strings, each within Hamming distance 1 of the original n-string. Let us call these n-strings neighbors of the original n-string. Thus, an n-string has a total of (|\Sigma| - 1) n neighbors.

__________

Word graphs

If two n-strings are neighbors of one another when their Hamming distance is equal to 1, then a graphical approach would be the natural way to go. Let G = (N,E) be an undirected graph whose set of nodes is N = \Sigma^n (set of all n-strings over alphabet \Sigma), and whose set of edges is E \subset \Sigma^n \times \Sigma^n. Two nodes are connected by an edge if their Hamming distance is equal to 1. Hence the total number of nodes in G is |N| = |\Sigma^n| = 26^n, and the total number of edges in G is

|E| = \displaystyle\frac{n | \Sigma^n|}{2} \left(| \Sigma| - 1\right) = \displaystyle\frac{n}{2} | \Sigma|^n\left(| \Sigma| - 1\right),

because each node has (|\Sigma| - 1) n neighbors, there are |\Sigma^n| nodes, and each edge is counted twice so we need to divide the total by 2. Note that G is a regular graph (of degree |\Sigma| - 1) because all nodes have the same number of neighbors.

For example, let n=3 and let us consider the 3-string (\text{c}, \text{a}, \text{r}), which has (|\Sigma| - 1) n = 75 neighbors. Out of these 75 words, I could count 21 words which actually mean something in the English language:

bar, ear, far, gar, jar, mar, oar, par, tar, war, yar, cor, cur, cab, cad, cam, can, cap, cat, caw, cay.

Please let me know if I forgot to include any word in this list. These words form a sub-graph of the graph of all 3-strings. Note that the words

bar, car, ear, far, gar, jar, mar, oar, par, tar, war, yar

are all neighbors of one another, and hence, they form a clique of size 12. The words

car, cor, cur

form a clique of size 3. The words

cab, cad, cam, can, cap, car, cat, caw, cay

are all neighbors of one another as well, and thus they form a clique of size 9. The subgraph formed by the word “car” and its neighbors which make sense in the English language is illustrated below, where the three cliques are easy to identify

Undirected Word Graph with 3 Cliques

[ graph generated with Graphviz ]

It is not practical to depict the whole graph of 3-strings over alphabet \Sigma since it contains |\Sigma^3| = |\Sigma|^3 = 26^3 = 17576 nodes and |E| = 659100 edges!

__________

Python code

In case you are wondering: no, I did not type the 26 symbols of the Latin alphabet in \LaTeX. That would have been way too cumbersome. All it took was one single line of Python code:


", ".join([("\text{" + chr(i) + "}") for i in range(ord('a'),ord('z')+1)])

Suppose we would like to find all strings within Hamming distance of 1 of string (\text{c}, \text{a}, \text{t}). The following Python script automates the process:

def MutateString(s, alphabet, index):
    li = []
    for ch in alphabet:
        if ch != s[index]:
            li.append(s[:index] + ch + s[index+1:])
    return li

# build Latin alphabet (lowercase symbols only)
Sigma = [chr(i) for i in range(ord('a'),ord('z')+1)]

# obtain list of mutated strings
print "Mutate 1st symbol:\n" + ", ".join(MutateString('cat', Sigma, 0)) + '\n'
print "Mutate 2nd symbol:\n" + ", ".join(MutateString('cat', Sigma, 1)) + '\n'
print "Mutate 3rd symbol:\n" + ", ".join(MutateString('cat', Sigma, 2)) + '\n'

The graph of the word “car” and its neighbors was generated also by a Python script using the GvGen library.

def HammingDistance(s1, s2):
    """Computes the Hamming distance between strings s1 and s2"""
    assert len(s1) == len(s2)
    return sum(ch1 != ch2 for ch1, ch2 in zip(s1, s2))

import gvgen

# list of words
words = ['car', 'bar', 'ear', 'far', 'gar', 'jar', 'mar', 'oar', 'par', 'tar', 'war', 'yar', 'cor', 'cur', 'cab', 'cad', 'cam', 'can', 'cap', 'cat', 'caw', 'cay']

# creates the new graph instance
graph = gvgen.GvGen()
graph.smart_mode = 1

# creates graph nodes
nodes = []
for elem in words:
    nodes.append(graph.newItem(elem))

# creates graph edges
for i in nodes:
    for j in nodes:

        # extracts the labels of each node
        node1 = i['properties']['label']
        node2 = j['properties']['label']

        # links nodes i and j if the Hamming
        # distance between their labels is 1
        if HammingDistance(node1,node2) == 1:
            graph.newLink(i,j)

# outputs the Graphviz code to a .dot file
filename = "test.dot"
FileOut = open(filename,"w")
graph.dot(FileOut)
FileOut.close()

The script outputs a DOT file which can be read by Graphviz. In particular, I used GVedit (with layout engine = circo) to generate the diagram depicted above.

Remark: these snippets of code worked with Python 2.5.2. I can not guarantee that they work with other versions.


Follow

Get every new post delivered to your Inbox.

Join 84 other followers