A more detailed walkthrough

Throughout these examples, we'll use a small sample file called hmtextract.cex in the test/data directory of this repository.

Since the file is in CEX format, we'll use the fromcex function from CitableBase to create different kinds of objects in the CitableCollection package.

f = joinpath(root, "test", "data", "hmtextract.cex")
using CitableBase, CitableCollection

Reading a catalog of collections

We can read data in citecollections blocks into a single catalog comprising all collections cataloged in the CEX source.

catalog = fromcex(f, CiteCollectionCatalog, FileReader, delimiter = "#")
Catalog  of 2 citable collections

The catalog is a citable collection. Let's get an idea of what's in it.

for coll in catalog
    println(coll)
end
CITE data models
Escorial Y 1.1 manuscript

Reading tables of data

Strict parsing

We can read data from citedata blocks into a series RawDataCollections. By default, the fromcex function will look for property definitions in a the citeproperties blocks of the CEX source, and require that each column in each table have a corresponding entry. It will then use the information from citeproperties to create an appropriate schema for the resulting table.

strictly = fromcex(f, RawDataCollection, FileReader, delimiter = "#")
2-element Vector{RawDataCollection}:
 Citable collection of 5 items with schema specified from `citeproperties` settings.
 Citable collection of 5 items with schema specified from `citeproperties` settings.
Return types of `fromcex`

Note that while using fromcex to instantiate data for aCiteCollectionCatalog always returns a single CiteCollectionCatalog object, instantiating data for a RawDataCollection returns a Vector of RawDataCollections, since each collection could have a different schema.

We can use the Tables package to examine the schema of a table.

using Tables
Tables.schema(strictly[2])
Tables.Schema:
 :sequence  Int64
 :image     String
 :urn       CitableObject.Cite2Urn
 :rv        InlineStrings.String7
 :label     InlineStrings.String31

The metadata in the cexproperties block looks like this:

#!citeproperties
Property#Label#Type#Authority list
urn:cite2:hmt:e3pages.v1.sequence:#Page sequence#Number#
urn:cite2:hmt:e3pages.v1.image:#TBS image#Cite2Urn#
urn:cite2:hmt:e3pages.v1.urn:#URN#Cite2Urn#
urn:cite2:hmt:e3pages.v1.rv:#Recto or Verso#String#recto,verso
urn:cite2:hmt:e3pages.v1.label:#Label#String#

Notice that fromcex choose appropriate Julia types for generic Number and String type indications, and converts the CEX data to URN types where they are indicated.

Lazy parsing

The CEX standard says that any single CEX block constitutes a valid CEX source. If you have a CEX source including citedata blocks, but no corresponding citeproperties blocks, you can still create RawDataCollections from them by setting the strict parameter to false

lazily = fromcex(f, RawDataCollection, FileReader, delimiter = "#", strict = false)
2-element Vector{RawDataCollection}:
 Citable collection of 5 items with automatically inferred schema.
 Citable collection of 5 items with automatically inferred schema.

When parsing lazily, fromcex converts the contents of a column named urn to type Cite2Urn; for other columns, it chooses types based on the column contents only. Notice that this results in URN values being treated as string data.

Tables.schema(lazily[2])
Tables.Schema:
 :sequence  Int64
 :image     String
 :urn       CitableObject.Cite2Urn
 :rv        InlineStrings.String7
 :label     InlineStrings.String31

Reading cataloged collections from a CEX source

If your CEX source includes both citeproperties data for the schema of each collection and a catalog of metadata for your collections, you can create a Vector of CatalogedCollections from the CEX.

cclist = fromcex(f, CatalogedCollection, FileReader, delimiter = "#")
2-element Vector{CatalogedCollection}:
 CITE data models
A cataloged collection containing 5 citable objects
 Escorial Y 1.1 manuscript
A cataloged collection containing 5 citable objects

Each CatalogedCollection has both a unique catalog entry and a raw data collection with a schema derived from its citeproperties information.

The schema will in other words is produced by strict parsing.

Tables.schema(cclist[2])
Tables.Schema:
 :sequence  Int64
 :image     String
 :urn       CitableObject.Cite2Urn
 :rv        InlineStrings.String7
 :label     InlineStrings.String31

The associated catalog information makes the CatalogedCollection a citable object.

label(cclist[2])
"Escorial Y 1.1 manuscript"
urn(cclist[2])
urn:cite2:hmt:e3pages.v1:

Querying collections

The CatalogedCollection is also a citable collection, so you can filter it using URN logic or by directly applying filter, map, or other generic Julia functions to it.

We could select data based on a version-agnostic URN, for example:

genericurn = dropversion(urn(cclist[2]))
urncontains(genericurn, cclist[2])
5-element Vector{NamedTuple{(:sequence, :image, :urn, :rv, :label), Tuple{Int64, String, CitableObject.Cite2Urn, InlineStrings.String7, InlineStrings.String31}}}:
 (sequence = 1, image = "urn:cite2:hmt:e3bifolio.v1:E3_1v_2r", urn = urn:cite2:hmt:e3pages.v1:1v, rv = "verso", label = "Escorial Y 1.1, folio 1 verso")
 (sequence = 2, image = "urn:cite2:hmt:e3bifolio.v1:E3_1v_2r", urn = urn:cite2:hmt:e3pages.v1:2r, rv = "recto", label = "Escorial Y 1.1, folio 2 recto")
 (sequence = 3, image = "urn:cite2:hmt:e3bifolio.v1:E3_2v_3r", urn = urn:cite2:hmt:e3pages.v1:2v, rv = "verso", label = "Escorial Y 1.1, folio 2 verso")
 (sequence = 4, image = "urn:cite2:hmt:e3bifolio.v1:E3_2v_3r", urn = urn:cite2:hmt:e3pages.v1:3r, rv = "recto", label = "Escorial Y 1.1, folio 3 recto")
 (sequence = 5, image = "urn:cite2:hmt:e3bifolio.v1:E3_3v_4r", urn = urn:cite2:hmt:e3pages.v1:3v, rv = "verso", label = "Escorial Y 1.1, folio 3 verso")

Since we have already examined the schema, we could use that knowledge to select only recto pages.

filter(r -> r.rv == "recto",  cclist[2])
2-element Vector{Any}:
 (sequence = 2, image = "urn:cite2:hmt:e3bifolio.v1:E3_1v_2r", urn = urn:cite2:hmt:e3pages.v1:2r, rv = InlineStrings.String7("recto"), label = InlineStrings.String31("Escorial Y 1.1, folio 2 recto"))
 (sequence = 4, image = "urn:cite2:hmt:e3bifolio.v1:E3_2v_3r", urn = urn:cite2:hmt:e3pages.v1:3r, rv = InlineStrings.String7("recto"), label = InlineStrings.String31("Escorial Y 1.1, folio 3 recto"))

RawDataCollections (and therefore also CatalogedCollections) make available all functions that can be applied to the TypedTables.Table type, so you can directly work with for loops, or operations like group and reduce, or you can use higher-order packages like Query.jl.