A number of fundamental computing science problems have been studied since the 1950s and 1960s. For each of these problems, numerous solutions (in the form of algorithms) have been developed over the years. In these collections of solutions, we can identify the following three deficiencies:- Algorithms solving the same problem are difficult to compare to one another. This could be due to the use of different programming languages, paradigms, styles of presentation | or simply the addition of unnecessary details.
- Collections of algorithm implementations solving a problem are difficult to find. Some of the algorithms are presented in a relatively obsolete manner, either using a now-defunct notation or obsolete programming language, making it difficult to either implement the algorithm or find an existing implementation.
- Little is known about the comparative practical running time performance of the algorithms. The lack of existing implementations in one and the same framework, especially of some of the older algorithms, has made it difficult to determine the running time characteristics of the algorithms. Selection of an algorithm must then be made on the basis of the theoretical running time, or simply by guessing.
In this dissertation, a solution to each of these deficiencies is presented. To present the solutions, we use the following three fundamental computing science problems:- Keyword pattern matching in strings. Given a finite non-empty set of keywords (the patterns) and an input string, find the set of all occurrences of a keyword as a substring of the input string.
- Finite automata (FA) construction. Given a regular expression, construct a finite automaton which accepts the language denoted by the regular expression.
- Deterministic finite automata (DFA) minimization. Given a DFA, construct the unique minimal DFA accepting the same language.
In the following paragraphs, we will outline the solutions presented for each of the deficiencies.
The difficulty in comparing algorithms is overcome by creating a taxonomy of algorithms for a given problem. Each of the algorithms is rewritten in a common notation and is examined to determine the essential ingredients that distinguish it from any other algorithm. These ingredients (known as details) can take the form of problem details (a restriction on the class of problems solved), or algorithm details (some correctness-preserving change to the algorithm to improve efficiency). Once each algorithm has been reduced in such a way, it can be characterized by its set of details. In presenting the taxonomy, the common details of several algorithms can be factored and presented together. In this fashion, a 'family tree' of the algorithms is constructed, showing clearly what any two algorithms have in common and where they differ. Because the root of the family tree is a nave, and correct, algorithm and the details are applied in a correctness-preserving manner, the correctness argument for each of the algorithms is implicit in the taxonomy.
The common notation and presentations in the taxonomies enable us to implement the algorithms uniformly, in the form of a class library (also known as a toolkit). The factoring of essential details, inherent in the taxonomies, leads to factoring of common components in the inheritance hierarchy of the class library. Object-oriented concepts, such as virtual (or deferred) member functions, inheritance and template classes, prove to be useful in presenting a coherent class library, the structure of which reflects the taxonomy from which it was created. For the first time, most (if not all) solutions are presented in a single class library, giving clients of the library a large choice of objects and functions.
With a class library that contains most of the known solutions, we are finally able to gather data on the performance of th
e algorithms in practice. Since the algorithms are taken from a single class library which was implemented by one person, and the quality of implementation of the class library is homogeneous, the relative performance data gathered (comparing algorithms) is not biased by the implementation. This performance data allows software engineers to make more informed decisions (based upon their needs, and the characteristics of their input data) concerning which algorithm and therefore which objects and functions to use in their applications.
The development of the taxonomies has not been without its spin-offs. In each of the three taxonomies presented, significant new algorithms have been developed. These algorithms are also implemented in the corresponding class libraries. The techniques developed in the taxonomy of pattern matching algorithms proved to be particularly useful in deriving an algorithm for regular expression pattern matching, and in doing so, answering an open question posed by A.V. Aho in [Aho80, p. 342].
[Aho80] A.V. Aho. Pattern matching in strings. In R.V. Book, ed., Formal Language Theory: Perspectives and Open Problems, pages 325-347, Academic Press, New York, 1980. |