At the center of the hocuspocus
library is the Corpus
object, representing a corpus of citable texts. Construct it with two files: a CTS TextInventory, and the root directory of the XML texts. E.g.,
Corpus c = new Corpus(new File("textInventory.xml"), new File("xmlDirectory"))
These will now be available as c.inventory
and c.baseDirectory
. Several methods allow you to inspect and verify the contents of your digital corpus.
filesInArchive()
: lists all XML files contained within c.baseDirectory
and its subdirectories (recursively)filesInInventory()
: lists all file names appearing in online@docname
values in the text inventory.urnsInInventory()
: lists CTS URNs for all texts identified as “online” in the text inventory.validateInventory()
: validates the text inventory against the published RNG schema.filesAndInventoryMatch()
: true if there is a 1-1 match of files in the text inventory and in the archival XML files.inventoriedMissingFromArchive()
: lists documents marked in the corpus text inventory as online but not appearing in the archive.filesMissingFromInventory()
: lists .xml
files in the archive lacking a corresponding “online” entry in the corpus TextInventory. tabulateRepository(java.io.File outputDir)
: converts all XML source files in the inventory to OHCO2-equivalent tabular format, and writes output to outputDir
.tokenizeRepository(TokenizationSystem tokenSystem, java.io.File outputDir)
: creates a two-column text file where each line is comprised of a token (identified by CTS URN, including subreference), and a type. Both the token and the value of the type depend on the tokenization system selected.turtleizeRepository(java.io.File outputDir)
: generates a representation of the entire edited corpus in a single TTL file in outputDir
.EditionGenerator
class. A repository-wide method will be added to the Corpus
class.