Generating all words over an alphabet

An alphabet \Sigma is a finite set of symbols. A word w of length n over an alphabet \Sigma is a sequence of symbols from \Sigma, i.e.,

w : \{0,1,\dots,n-1\} \to \Sigma

Alternatively, we can view a word w of length n over the alphabet \Sigma as an n-tuple (i.e., an ordered list with n elements) overĀ \Sigma, i.e., w \in \Sigma^n. The set of all finite words over \Sigma, including the empty word \varepsilon, is denoted by \Sigma^*, where * is the Kleene star.

We thus encounter the following problem:

Problem: Given an alphabet \Sigma, how do we generate all the words over \Sigma? In other words, given \Sigma, how do we generate the Kleene closure \Sigma^*?

The following Haskell script solves this problem:

g :: [a] -> [[a]] -> [[a]]
g alphabet = concat . map (\xs -> [ xs ++ [s] | s <- alphabet])

allwords :: [a] -> [[a]]
allwords alphabet = concat $ iterate (g alphabet) [[]]

where alphabet must be a finite list, otherwise the execution of the script won’t ever terminate. The script above, although short, uses a lot of machinery: list comprehension, higher-order functions map and iterate, concatenation of lists of lists, and partial function application. Let us test this script. First, we load it into GHCi. Then we can run a GHCi session:

*Main> -- define alphabet
*Main> let alphabet = [0,1]
*Main> -- define function f
*Main> let f = g alphabet
*Main> -- check type
*Main> :t f
f :: [[Integer]] -> [[Integer]]
*Main> -- apply f to several lists of lists
*Main> f [[]]
[[0],[1]]
*Main> (f . f) [[]]
[[0,0],[0,1],[1,0],[1,1]]
*Main> (f . f . f) [[]]
[[0,0,0],[0,0,1],[0,1,0],[0,1,1],[1,0,0],[1,0,1],[1,1,0],[1,1,1]]

So far, so good. Suppose that we would like to find all words over alphabet \Sigma_2 := \{0,1\} whose length is less than or equal to 4. How many such words are there? There are |\Sigma_2|^k = 2^k words of length k. Hence, there are

\displaystyle\sum_{k=0}^n 2^k = \frac{2^{n+1} -1}{2-1} = 2^{n+1} - 1

binary words of length less than or equal to n. Hence, if we make n = 4, then we obtain 2^{n+1}-1 = 31. Continuing our GHCi session:

*Main> take 31 $ allwords alphabet
[[],[0],[1],
[0,0],[0,1],[1,0],[1,1],
[0,0,0],[0,0,1],[0,1,0],[0,1,1],
[1,0,0],[1,0,1],[1,1,0],[1,1,1],
[0,0,0,0],[0,0,0,1],[0,0,1,0],[0,0,1,1],
[0,1,0,0],[0,1,0,1],[0,1,1,0],[0,1,1,1],
[1,0,0,0],[1,0,0,1],[1,0,1,0],[1,0,1,1],
[1,1,0,0],[1,1,0,1],[1,1,1,0],[1,1,1,1]]

Very nice! We could use this in Coding Theory! For example, suppose that we would like to find all binary words of length 8 whose Hamming weight is equal to 5. There are

\displaystyle\binom{8}{5} = \frac{8!}{3! 5!} = 56

such words. We can find them as follows:

*Main> -- take all words of length less than or equal to 8
*Main>  let words = take 511 $ allwords alphabet
*Main> -- filter out binary words of length 8
*Main> let wordsL8 = filter (\xs -> length xs == 8) words
*Main> length wordsL8
256
*Main> -- filter out binary words of length 8
*Main> -- and also of Hamming weight equal to 5
*Main> let wordsL8H5 = filter (\xs -> sum xs == 5) wordsL8
*Main> length wordsL8H5
56
*Main> wordsL8H5
[[0,0,0,1,1,1,1,1],[0,0,1,0,1,1,1,1],[0,0,1,1,0,1,1,1],
 [0,0,1,1,1,0,1,1],[0,0,1,1,1,1,0,1],[0,0,1,1,1,1,1,0],
 [0,1,0,0,1,1,1,1],[0,1,0,1,0,1,1,1],[0,1,0,1,1,0,1,1],
 [0,1,0,1,1,1,0,1],[0,1,0,1,1,1,1,0],[0,1,1,0,0,1,1,1],
 [0,1,1,0,1,0,1,1],[0,1,1,0,1,1,0,1],[0,1,1,0,1,1,1,0],
 [0,1,1,1,0,0,1,1],[0,1,1,1,0,1,0,1],[0,1,1,1,0,1,1,0],
 [0,1,1,1,1,0,0,1],[0,1,1,1,1,0,1,0],[0,1,1,1,1,1,0,0],
 [1,0,0,0,1,1,1,1],[1,0,0,1,0,1,1,1],[1,0,0,1,1,0,1,1],
 [1,0,0,1,1,1,0,1],[1,0,0,1,1,1,1,0],[1,0,1,0,0,1,1,1],
 [1,0,1,0,1,0,1,1],[1,0,1,0,1,1,0,1],[1,0,1,0,1,1,1,0],
 [1,0,1,1,0,0,1,1],[1,0,1,1,0,1,0,1],[1,0,1,1,0,1,1,0],
 [1,0,1,1,1,0,0,1],[1,0,1,1,1,0,1,0],[1,0,1,1,1,1,0,0],
 [1,1,0,0,0,1,1,1],[1,1,0,0,1,0,1,1],[1,1,0,0,1,1,0,1],
 [1,1,0,0,1,1,1,0],[1,1,0,1,0,0,1,1],[1,1,0,1,0,1,0,1],
 [1,1,0,1,0,1,1,0],[1,1,0,1,1,0,0,1],[1,1,0,1,1,0,1,0],
 [1,1,0,1,1,1,0,0],[1,1,1,0,0,0,1,1],[1,1,1,0,0,1,0,1],
 [1,1,1,0,0,1,1,0],[1,1,1,0,1,0,0,1],[1,1,1,0,1,0,1,0],
 [1,1,1,0,1,1,0,0],[1,1,1,1,0,0,0,1],[1,1,1,1,0,0,1,0],
 [1,1,1,1,0,1,0,0],[1,1,1,1,1,0,0,0]]

where we used function take (again) and higher-order function filter. The two predicates were built using anonymous functions.

I am happy with this script. Please let me know in case you are not.

__________

Related:

About these ads

Tags: , , ,

3 Responses to “Generating all words over an alphabet”

  1. Rod Carvalho Says:

    Suppose now that we’re doing Molecular Biology and would like to generate all DNA single strands. The alphabet is now \Sigma = \{A,C,G,T\}. The following script is a modification of the one I presented in the post above:

    data Nucleotide = A | C | G | T deriving (Show, Eq)
    
    
    alphabet :: [Nucleotide]
    alphabet = [A,C,G,T]
    
    
    g :: [[Nucleotide]] -> [[Nucleotide]]
    g = concat . map (\xs -> [ xs ++ [s] | s <- alphabet])
    
    
    allwords :: [[Nucleotide]]
    allwords = concat $ iterate g [[]]
    

    Suppose that we want to generate all DNA single strands of length less than or equal to 3. Here’s a GHCi session:

    *Main> take 85 allwords
    [[],[A],[C],[G],[T],
    [A,A],[A,C],[A,G],[A,T],[C,A],[C,C],[C,G],[C,T],
    [G,A],[G,C],[G,G],[G,T],[T,A],[T,C],[T,G],[T,T],
    [A,A,A],[A,A,C],[A,A,G],[A,A,T],[A,C,A],[A,C,C],
    [A,C,G],[A,C,T],[A,G,A],[A,G,C],[A,G,G],[A,G,T],
    [A,T,A],[A,T,C],[A,T,G],[A,T,T],[C,A,A],[C,A,C],
    [C,A,G],[C,A,T],[C,C,A],[C,C,C],[C,C,G],[C,C,T],
    [C,G,A],[C,G,C],[C,G,G],[C,G,T],[C,T,A],[C,T,C],
    [C,T,G],[C,T,T],[G,A,A],[G,A,C],[G,A,G],[G,A,T],
    [G,C,A],[G,C,C],[G,C,G],[G,C,T],[G,G,A],[G,G,C],
    [G,G,G],[G,G,T],[G,T,A],[G,T,C],[G,T,G],[G,T,T],
    [T,A,A],[T,A,C],[T,A,G],[T,A,T],[T,C,A],[T,C,C],
    [T,C,G],[T,C,T],[T,G,A],[T,G,C],[T,G,G],[T,G,T],
    [T,T,A],[T,T,C],[T,T,G],[T,T,T]]
    
  2. vfp15 Says:

    In 1953 Arthur C. Clarke wrote a story about why I don’t like your algorithm. It’s called The Nine Billion Names of God.

    :)

  3. AA Says:

    I have been approaching similar questions by inverting regular expressions.

    For example [ACGT]{0,3} or [01]{0,3}

    Of course the vectors that are generated in this way are not “computable” (“1010″ is not dec(10)) but they could be turned into more computable forms.

    You could find implementations in different languages by searching for xeger (the name emerged in the java community but has been used across languages to denote such generators)

    My favourite has been Python purely because the re module contains the sre_parse function that can take a regexp and return its “parsing tree”. Walking this tree makes it possible to generate the codes as well as count them.

    Here is a nice little module that demonstrates some of these points:
    https://github.com/asciimoo/exrex

    The discrepancy in Python (and i suppose other languages as well) is that the operator * actually translates to {0,65536}.

    I really liked the Haskell examples presented here as well though.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

Join 77 other followers

%d bloggers like this: