Specifications for hocuspocus, version 0.12.6 >

Basic contents of a corpus of texts

Constructing a corpus

An archival corpus is made up of a set of text files, and an inventory documenting the citable structure of each document.

Example

We can use this TextInventory file with files in this root directory to construct a Corpus.

The inventory

When serialized as XML, the inventory validates against a Relax NG schema.

For a given corpus, we can determine:

Examples

In the example corpus defined above, the inventory contains entries for 3 files.

The file names are:

IndexFile path
0Iliad-A.xml
1Iliad-Butler.xml
2tier2/Iliad-B.xml

Their URNs are :

IndexFile path
0urn:cts:greekLit:tlg0012.tlg001.butler:
1urn:cts:greekLit:tlg0012.tlg001.msA:
2urn:cts:greekLit:tlg0012.tlg001.msB:

The archive of files

For a given corpus, we can determine:

Examples

In the example corpus defined above, the inventory contains entries for 3 files.

These files are found in the file system:

IndexFile path
0Iliad-A.xml
1Iliad-Butler.xml
2tier2/Iliad-B.xml

Validating a corpus

We can determine if the list of files in the inventory have a one-to-one relation to the XML files in the directory hierarchy. We can get names of documents identified in the inventory but not found on disk, and names of files found on disk but not identified in the inventory.

Examples

One-to-one match. In the example corpus defined above, the files and inventory do match (have a one-to-one correspondence).

Files on disk missing from inventory. If we use this TextInventory file with the same set of archival files, we can construct a valid Corpus, even though it contains only 1 entry for an online file. We can verify that files listed in the inventory and files on diskdo not match, and can determine that 2 files in the file system does not appear in the inventory, and that the first item (item 0) in the list of missing files is Iliad-Butler.xml.

Files in inventory not found in file system. If, with the same set of archival files, we use a TextInventory listing additional files as online , we can still construct a valid Corpus, even though it contains 3 entries. We can verify that files listed in the inventory and files on diskdo not match, and can determine that 1 file in the file system does not appear in the inventory, and that the first item in the list of missing files is Iliad-C.xml.