An alphabet is a finite set of symbols. A word
of length
over an alphabet
is a sequence of symbols from
, i.e.,
Alternatively, we can view a word of length
over the alphabet
as an
-tuple (i.e., an ordered list with
elements) overĀ
, i.e.,
. The set of all finite words over
, including the empty word
, is denoted by
, where
is the Kleene star.
We thus encounter the following problem:
Problem: Given an alphabet , how do we generate all the words over
? In other words, given
, how do we generate the Kleene closure
?
The following Haskell script solves this problem:
g :: [a] -> [[a]] -> [[a]] g alphabet = concat . map (\xs -> [ xs ++ [s] | s <- alphabet]) allwords :: [a] -> [[a]] allwords alphabet = concat $ iterate (g alphabet) [[]]
where alphabet must be a finite list, otherwise the execution of the script won’t ever terminate. The script above, although short, uses a lot of machinery: list comprehension, higher-order functions map and iterate, concatenation of lists of lists, and partial function application. Let us test this script. First, we load it into GHCi. Then we can run a GHCi session:
*Main> -- define alphabet *Main> let alphabet = [0,1] *Main> -- define function f *Main> let f = g alphabet *Main> -- check type *Main> :t f f :: [[Integer]] -> [[Integer]] *Main> -- apply f to several lists of lists *Main> f [[]] [[0],[1]] *Main> (f . f) [[]] [[0,0],[0,1],[1,0],[1,1]] *Main> (f . f . f) [[]] [[0,0,0],[0,0,1],[0,1,0],[0,1,1],[1,0,0],[1,0,1],[1,1,0],[1,1,1]]
So far, so good. Suppose that we would like to find all words over alphabet whose length is less than or equal to
. How many such words are there? There are
words of length
. Hence, there are
binary words of length less than or equal to . Hence, if we make
, then we obtain
. Continuing our GHCi session:
*Main> take 31 $ allwords alphabet [[],[0],[1], [0,0],[0,1],[1,0],[1,1], [0,0,0],[0,0,1],[0,1,0],[0,1,1], [1,0,0],[1,0,1],[1,1,0],[1,1,1], [0,0,0,0],[0,0,0,1],[0,0,1,0],[0,0,1,1], [0,1,0,0],[0,1,0,1],[0,1,1,0],[0,1,1,1], [1,0,0,0],[1,0,0,1],[1,0,1,0],[1,0,1,1], [1,1,0,0],[1,1,0,1],[1,1,1,0],[1,1,1,1]]
Very nice! We could use this in Coding Theory! For example, suppose that we would like to find all binary words of length whose Hamming weight is equal to
. There are
such words. We can find them as follows:
*Main> -- take all words of length less than or equal to 8 *Main> let words = take 511 $ allwords alphabet *Main> -- filter out binary words of length 8 *Main> let wordsL8 = filter (\xs -> length xs == 8) words *Main> length wordsL8 256 *Main> -- filter out binary words of length 8 *Main> -- and also of Hamming weight equal to 5 *Main> let wordsL8H5 = filter (\xs -> sum xs == 5) wordsL8 *Main> length wordsL8H5 56 *Main> wordsL8H5 [[0,0,0,1,1,1,1,1],[0,0,1,0,1,1,1,1],[0,0,1,1,0,1,1,1], [0,0,1,1,1,0,1,1],[0,0,1,1,1,1,0,1],[0,0,1,1,1,1,1,0], [0,1,0,0,1,1,1,1],[0,1,0,1,0,1,1,1],[0,1,0,1,1,0,1,1], [0,1,0,1,1,1,0,1],[0,1,0,1,1,1,1,0],[0,1,1,0,0,1,1,1], [0,1,1,0,1,0,1,1],[0,1,1,0,1,1,0,1],[0,1,1,0,1,1,1,0], [0,1,1,1,0,0,1,1],[0,1,1,1,0,1,0,1],[0,1,1,1,0,1,1,0], [0,1,1,1,1,0,0,1],[0,1,1,1,1,0,1,0],[0,1,1,1,1,1,0,0], [1,0,0,0,1,1,1,1],[1,0,0,1,0,1,1,1],[1,0,0,1,1,0,1,1], [1,0,0,1,1,1,0,1],[1,0,0,1,1,1,1,0],[1,0,1,0,0,1,1,1], [1,0,1,0,1,0,1,1],[1,0,1,0,1,1,0,1],[1,0,1,0,1,1,1,0], [1,0,1,1,0,0,1,1],[1,0,1,1,0,1,0,1],[1,0,1,1,0,1,1,0], [1,0,1,1,1,0,0,1],[1,0,1,1,1,0,1,0],[1,0,1,1,1,1,0,0], [1,1,0,0,0,1,1,1],[1,1,0,0,1,0,1,1],[1,1,0,0,1,1,0,1], [1,1,0,0,1,1,1,0],[1,1,0,1,0,0,1,1],[1,1,0,1,0,1,0,1], [1,1,0,1,0,1,1,0],[1,1,0,1,1,0,0,1],[1,1,0,1,1,0,1,0], [1,1,0,1,1,1,0,0],[1,1,1,0,0,0,1,1],[1,1,1,0,0,1,0,1], [1,1,1,0,0,1,1,0],[1,1,1,0,1,0,0,1],[1,1,1,0,1,0,1,0], [1,1,1,0,1,1,0,0],[1,1,1,1,0,0,0,1],[1,1,1,1,0,0,1,0], [1,1,1,1,0,1,0,0],[1,1,1,1,1,0,0,0]]
where we used function take (again) and higher-order function filter. The two predicates were built using anonymous functions.
I am happy with this script. Please let me know in case you are not.
__________
Related:
- Distance between two words (2009)
Tags: Haskell, Higher-Order Functions, List Processing, Stringology
October 1, 2012 at 23:14 |
Suppose now that we’re doing Molecular Biology and would like to generate all DNA single strands. The alphabet is now
. The following script is a modification of the one I presented in the post above:
Suppose that we want to generate all DNA single strands of length less than or equal to
. Here’s a GHCi session:
October 2, 2012 at 03:43 |
In 1953 Arthur C. Clarke wrote a story about why I don’t like your algorithm. It’s called The Nine Billion Names of God.
:)
October 2, 2012 at 04:14 |
I have been approaching similar questions by inverting regular expressions.
For example [ACGT]{0,3} or [01]{0,3}
Of course the vectors that are generated in this way are not “computable” (“1010″ is not dec(10)) but they could be turned into more computable forms.
You could find implementations in different languages by searching for xeger (the name emerged in the java community but has been used across languages to denote such generators)
My favourite has been Python purely because the re module contains the sre_parse function that can take a regexp and return its “parsing tree”. Walking this tree makes it possible to generate the codes as well as count them.
Here is a nice little module that demonstrates some of these points:
https://github.com/asciimoo/exrex
The discrepancy in Python (and i suppose other languages as well) is that the operator * actually translates to {0,65536}.
I really liked the Haskell examples presented here as well though.