An Introduction


H. Joachim Neuhaus

1. Textual information systems

The design for the Shakespeare Database project is part of an ongoing research programme to develop integrated textual information systems. Instead of textual information systems we could also speak of  philological information systems  in order to stress a certain continuity of scholarly methods and goals in dealing with texts, even though the methodological potential is much greater and will certainly cause far-reaching changes in the field. No doubt, in the future there will still be a legitimate interest in literary texts, which pursues the single case, the crux, and many philologists may continue in the  Notes and Queries  tradition. But there will be an electronic docuverse to integrate and transport these efforts.

Such new information systems go beyond traditional electronic text retrieval and take full advantage of the more recent tools of  database management, expert systems  and the technical possibilities of  hypermedia  applications. The Shakespeare texts are sufficiently complex objects to make simple demo-solutions quite impractical. But they are still manageable in terms of size and storage space. And the Shakespeare texts have probably been more thoroughly studied, annotated and edited than any other modern text. There is a vast amount of factual information and critical assessment for knowledge-based systems to be built upon.

Another incentive for developing such integrated information systems has been a growing dissatisfaction both with   conventional full-text retrieval software  and with  standard database query languages.Such query languages are often much too complicated as a database interface for ordinary users, but also for experienced users who are using systems as heuristic devices. Full-text retrieval even if enhanced by wildcard-options, context specifications, and nearby parameters in sense of regular expressions is much too weak a tool for serious linguistic or literary information retrieval. At best it may be used for getting clues, and it assumes professional background knowledge to be employed to advantage. It should not be surprising that the impact of full-text retrieval so far on modern editing or literary criticism has only been marginal. There has not yet been an  electronic revolution in editing  as there has been a revolution due to methods and results of the new bibliography some decades ago. For many authors as well as for Shakespeare we are still using commentaries, grammars, and dictionaries first compiled and edited in the 19th century as standard reference works. Quite often these reference works, such as E. A. Abbott,  A Shakespearian Grammar  (first published in 1869) are nowadays part of an electronic bundle of disparate sources published as a CD-ROM product. In spite of numerous published concordances and dozens of electronic retrieval packages the electronic text still plays a conspicuously secondary role. There is no published electronic Shakespeare edition, which tries to demonstrate the potential of the new philology, all publications are based on prior conventional editions published in book format.

The problem with textual databases is less prominent simply because there are as yet so few examples. The word  database  itself is still widely used in the sense of a machine-readable textual archive and not in the established technical sense of an entity-relationship structure. The main disadvantage of these systems in textual applications seems to be the admission that at present the database designer is probably the only really successful database user. This is due to the fact that database queries in conventional systems presuppose a complete understanding of the database architecture, the  database entities,  their  attributes,  and the  database relations.  In a time where for many philologists and literary critics a word is still a word it seems to be a bit frivolous to presuppose a structural knowledge of linguistic lemmatization relations versus type-token relations, or phrasal units versus morphological units to give just two minor examples for possible database entities. In dealing with Shakespeare there is still some confusion in this matter. The popular BBC television-series  The Story of English  reinforced a cliché by stating "Shakespeare had one of the largest vocabularies of any English writer, some 30,000 words." (McCrumm, R. (1986) 102. London: Faber). A prominent Shakespearean, Stanley Wells, mentions Spevack's  Concordance  under his entry "Vocabulary" and writes: "Spevack's Concordance lists 29,066 different words in Shakespeare's works, and 884,647 words altogether." (Wells, S. (rev. edn. 1985), Shakespeare, An Illustrated Dictionary,  185. Oxford: University Press). "Different words" here probably means text-types (i.e. sets of equiform text-tokens). Shakespeare's vocabulary in the sense of a lexical entry (lemma) is noticeable smaller, less than 20,000 lemmata according to Shakespeare Database statistics. Prior to the use of textual database architectures such basic questions, as the size of an author's vocabulary, did not have reliable answers. The database concept introduces a new level of rigour, consistency, and completeness into textual philology, since it presupposes clear definitions of all entities and relations and at the same time enforces these definitions for each and every case. The Shakespeare Database has built such a consistent information structure, and is able to answer conventional questions as well as new kinds of questions, which traditionally had no way to be answered. This is the true new potential for the editorial and critical enterprise.

The tremendous success of database technology in recent years can be found in applications outside of textual studies where there is an obvious or clearly defined internal structure. An accounting system, a subscription system for a journal, an inventory system, or a component system for a construction plant, all these database systems have a clearly defined structure, a typical user has straightforward queries and there is a routine profile of database transactions. This is generally not yet the case for textual databases. But in contrast to full-text retrieval systems the dissatisfaction with database systems is clearly not due to inherent limitations of the database concept itself. It is the user interface which is currently much too ineffectual and unsatisfactory for these applications.

The SDB project has been using dedicated computer installations since 1990. A testing database was first implemented on a PR1ME™ minicomputer running CODASYL compliant network database software. The database was then systematically amended and successfully migrated to a VAXstation 3100 SPX™ workstation cluster and Digital's rdb™ database software. In 1999 is was implemented on Digital's ALPHA™ processor hardware and Oracle™ database software. Currently it is hosted on an CentOS Linux server installation and accessed mostly using Oracle Application Express (Oracle APEX™) as a rapid web application development tool.