Utilities for working with citable collections

In the CEX format, citedata blocks contain tables of data. Metadata about the structure of these data can be added in citeproperties blocks. Semantic models for those structures can be specified in datamodels blocks. The CexUtils submodule simplifies coordinating these three facets of a collection: data, structural metadata, semantic model.

Data used in these examples

The utility functions documented here are designed to work identically with multiple kinds of CEX sources: pure CEX strings, blocks of CEX data parsed into the CiteEXchange module's Block structure, files with CEX data, or CEX content located at a URL.

The examples in the following pages are illustrated using the hmt-2022k.cex published release of the Homer Multitext project, a complex CEX source with roughly 18 Mb of plain-text data. A copy of that file is in the test/data directory of this repository.

f = joinpath(root, "test", "data", "hmt-2022k.cex")
s = read(f) |> String
u = "https://raw.githubusercontent.com/cite-architecture/CitableObject.jl/main/test/data/hmt-2022k.cex"
using CiteEXchange
blks = blocks(s)
# Length in characters:
length(s)
17992785

Find properties of a citable collection

urn:cite2:hmt:msB.v1: identifies a collection of images of a particular manuscript. Let's use the CEX string to find structural metadata – the properties of that collection. (Concretely, that means finding the relevant content of all citeproperties blocks in the CEX source).

using CitableObject
using CitableObject.CexUtils
msbimgs = Cite2Urn("urn:cite2:hmt:msB.v1:")
sprops = properties(s, msbimgs)
5-element Vector{SubString{String}}:
 "urn:cite2:hmt:msB.v1.sequence:|Page sequence|Number|"
 "urn:cite2:hmt:msB.v1.urn:|URN|Cite2Urn|"
 "urn:cite2:hmt:msB.v1.rv:|Recto or Verso|String|recto,verso"
 "urn:cite2:hmt:msB.v1.label:|Label|String|"
 "urn:cite2:hmt:msB.v1.image:|TBS image|Cite2Urn|"

We get the same result if we read a Vector of Blocks.

bprops = properties(blks, msbimgs)
sprops == bprops
true

We can also read from files or URL sources.

using CitableBase: FileReader
fprops = properties(f, msbimgs, FileReader)

using CitableBase: UrlReader
uprops = properties(u, msbimgs, UrlReader)

fprops == uprops == sprops
true

Find data lines of a citable collection

Now let's find the data for the same collection (that is, the relevant content of all citedata blocks in the CEX source).

We can use a pure string of CEX data.

sdata = collectiondata(s, msbimgs)
length(sdata)
683

Or again, we can read from files, URLs, or lists of Blocks.

bdata = collectiondata(blks, msbimgs)
fdata = collectiondata(f, msbimgs, FileReader)
udata = collectiondata(u, msbimgs, UrlReader)

length(sdata) == length(bdata) == length(fdata) == length(udata)
true

Find collections implementing a datamodel

Like everything else in the CITE architecture, we identify data models with a URN. The Homer Multitext project defines a data model for the structure of a manuscript (or codex).

model = Cite2Urn("urn:cite2:hmt:datamodels.v1:codexmodel")
urn:cite2:hmt:datamodels.v1:codexmodel

We can find Cite2Urns for all collections implementing this model.

implementations(s, model)
7-element Vector{Cite2Urn}:
 urn:cite2:citebl:burney86pages.v1:
 urn:cite2:hmt:e3pages.v1:
 urn:cite2:hmt:e4pages.v1:
 urn:cite2:citelaur:laur32pages.v1:
 urn:cite2:hmt:u4pages.v1:
 urn:cite2:hmt:msA.v1:
 urn:cite2:hmt:msB.v1:

The results are the same no matter what kind of source we read from.

bmodelurns = implementations(blks, msbimgs)
fmodelurns = implementations(f, msbimgs, FileReader)
umodelurns = implementations(u, msbimgs, UrlReader)

bmodelurns == fmodelurns == umodelurns
true

Find collection data for a datamodel

We could manually collect URNs for collections implementing a data model, then find data for each collection, but with the data_for_model function we can collect all data for all collections implementing a model in a single step.

s_implementingdata = data_for_model(s, model)
4254-element Vector{SubString{String}}:
 "2|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 1 recto"
 "3|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 1 verso"
 "4|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 2 recto"
 "5|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 2 verso"
 "6|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 3 recto"
 "7|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 3 verso"
 "8|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 4 recto"
 "9|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 4 verso"
 "10|urn:cite2:citebl:burney86img" ⋯ 77 bytes ⋯ "brary, Burney 86, folio 5 recto"
 "11|urn:cite2:citebl:burney86img" ⋯ 77 bytes ⋯ "brary, Burney 86, folio 5 verso"
 ⋮
 "671|urn:cite2:hmt:msB.v1:336r|r" ⋯ 80 bytes ⋯ ":vbbifolio.pending:vb_335v_336r"
 "672|urn:cite2:hmt:msB.v1:336v|v" ⋯ 80 bytes ⋯ ":vbbifolio.pending:vb_336v_337r"
 "673|urn:cite2:hmt:msB.v1:337r|r" ⋯ 80 bytes ⋯ ":vbbifolio.pending:vb_336v_337r"
 "674|urn:cite2:hmt:msB.v1:337v|v" ⋯ 80 bytes ⋯ ":vbbifolio.pending:vb_337v_338r"
 "675|urn:cite2:hmt:msB.v1:338r|r" ⋯ 80 bytes ⋯ ":vbbifolio.pending:vb_337v_338r"
 "676|urn:cite2:hmt:msB.v1:338v|v" ⋯ 81 bytes ⋯ "vbbifolio.pending:338v_backflyr"
 "677|urn:cite2:hmt:msB.v1:backfl" ⋯ 89 bytes ⋯ "vbbifolio.pending:338v_backflyr"
 "678|urn:cite2:hmt:msB.v1:backfl" ⋯ 94 bytes ⋯ "olio.pending:backflyv_backcover"
 "679|urn:cite2:hmt:msB.v1:backco" ⋯ 92 bytes ⋯ "olio.pending:backflyv_backcover"

Of course this works with any CEX source.

b_implementingdata = data_for_model(blks, model)
f_implementingdata = data_for_model(f, model, FileReader)
u_implementingdata =  data_for_model(u, model, UrlReader)
s_implementingdata == b_implementingdata == f_implementingdata == u_implementingdata
true

Find a human-readable label for a collection

The cataloglabel function finds the description property of the catalog entry for a given collection. (If the collection is not cataloged, it generates a generic label.)

msb = Cite2Urn("urn:cite2:hmt:msB.v1:")
s_label = cataloglabel(s, msb)
"Venetus B manuscript"

Or from any other source:

b_label = cataloglabel(blocks(s), msb)
f_label = cataloglabel(f, msb, FileReader)
u_label = cataloglabel(u, msb, UrlReader)
s_label == b_label == f_label == u_label
true

Find URN/label pairs for collections implementing a data model