##Overview ##
Most generally, a text corpus management system stores texts in their archival format, and operates on those archival texts to prepare standard kinds of output.
In traditional collections of print or other physical resources, archives can store unpublished material. Digital resources, on the other hand, must be published in various formats, including their underlying archival format. The two most fundamental forms of access to archived texts are:
Publishing and managing dependencies on texts in archival formats is no different than publishing and managing code or documentation libraries, and can be implemented with a repository manager. The simplest form of export for use in a CTS is the tabular data format supported by the CHS implementation of Canonical Text Services. We can implement the two requirements above as:
How the texts in a corpus should be further processed depends both on the content (including its language and markup schema), and the intended uses of the processed data. Many batch processing operations therefore are best left up to specific applications, but some are so generic that they are worth building into a corpus management system. Those generic operations include:
Requirements are:
Based on the language identifier for the text, tokens should be classified. For Greek and Latin texts, classification categories should include:
All parsed string tokens should be mapped to one or more possible lexical entity identifiers. These lexical entities should be referred to by a CITE Object URN identifying the lexical entity within a collection of entities for a given language. The end result of stemming therefore will be a mapping of CTS URNs to CITE Object URNs.
While recognition of named entities can take be incorporated into language-specific token classification, disambiguation of named entities may be best left for later processing that might approach disambiguation differently dependent on specific contexts within a large corpus.
For supported schemas, the system should also support export of individual texts or schemas as: