Utilities for working with citable collections
In the CEX format, citedata blocks contain tables of data. Metadata about the structure of these data can be added in citeproperties blocks. Semantic models for those structures can be specified in datamodels blocks. The CexUtils submodule simplifies coordinating these three facets of a collection: data, structural metadata, semantic model.
Data used in these examples
The utility functions documented here are designed to work identically with multiple kinds of CEX sources: pure CEX strings, blocks of CEX data parsed into the CiteEXchange module's Block structure, files with CEX data, or CEX content located at a URL.
The examples in the following pages are illustrated using the hmt-2022k.cex published release of the Homer Multitext project, a complex CEX source with roughly 18 Mb of plain-text data. A copy of that file is in the test/data directory of this repository.
f = joinpath(root, "test", "data", "hmt-2022k.cex")
s = read(f) |> String
u = "https://raw.githubusercontent.com/cite-architecture/CitableObject.jl/main/test/data/hmt-2022k.cex"
using CiteEXchange
blks = blocks(s)
# Length in characters:
length(s)17992785Find properties of a citable collection
urn:cite2:hmt:msB.v1: identifies a collection of images of a particular manuscript. Let's use the CEX string to find structural metadata – the properties of that collection. (Concretely, that means finding the relevant content of all citeproperties blocks in the CEX source).
using CitableObject
using CitableObject.CexUtils
msbimgs = Cite2Urn("urn:cite2:hmt:msB.v1:")
sprops = properties(s, msbimgs)5-element Vector{SubString{String}}:
"urn:cite2:hmt:msB.v1.sequence:|Page sequence|Number|"
"urn:cite2:hmt:msB.v1.urn:|URN|Cite2Urn|"
"urn:cite2:hmt:msB.v1.rv:|Recto or Verso|String|recto,verso"
"urn:cite2:hmt:msB.v1.label:|Label|String|"
"urn:cite2:hmt:msB.v1.image:|TBS image|Cite2Urn|"We get the same result if we read a Vector of Blocks.
bprops = properties(blks, msbimgs)
sprops == bpropstrueWe can also read from files or URL sources.
using CitableBase: FileReader
fprops = properties(f, msbimgs, FileReader)
using CitableBase: UrlReader
uprops = properties(u, msbimgs, UrlReader)
fprops == uprops == spropstrueFind data lines of a citable collection
Now let's find the data for the same collection (that is, the relevant content of all citedata blocks in the CEX source).
We can use a pure string of CEX data.
sdata = collectiondata(s, msbimgs)
length(sdata)683Or again, we can read from files, URLs, or lists of Blocks.
bdata = collectiondata(blks, msbimgs)
fdata = collectiondata(f, msbimgs, FileReader)
udata = collectiondata(u, msbimgs, UrlReader)
length(sdata) == length(bdata) == length(fdata) == length(udata)trueFind collections implementing a datamodel
Like everything else in the CITE architecture, we identify data models with a URN. The Homer Multitext project defines a data model for the structure of a manuscript (or codex).
model = Cite2Urn("urn:cite2:hmt:datamodels.v1:codexmodel")urn:cite2:hmt:datamodels.v1:codexmodelWe can find Cite2Urns for all collections implementing this model.
implementations(s, model)7-element Vector{Cite2Urn}:
urn:cite2:citebl:burney86pages.v1:
urn:cite2:hmt:e3pages.v1:
urn:cite2:hmt:e4pages.v1:
urn:cite2:citelaur:laur32pages.v1:
urn:cite2:hmt:u4pages.v1:
urn:cite2:hmt:msA.v1:
urn:cite2:hmt:msB.v1:The results are the same no matter what kind of source we read from.
bmodelurns = implementations(blks, msbimgs)
fmodelurns = implementations(f, msbimgs, FileReader)
umodelurns = implementations(u, msbimgs, UrlReader)
bmodelurns == fmodelurns == umodelurnstrueFind collection data for a datamodel
We could manually collect URNs for collections implementing a data model, then find data for each collection, but with the data_for_model function we can collect all data for all collections implementing a model in a single step.
s_implementingdata = data_for_model(s, model)4254-element Vector{SubString{String}}:
"2|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 1 recto"
"3|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 1 verso"
"4|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 2 recto"
"5|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 2 verso"
"6|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 3 recto"
"7|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 3 verso"
"8|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 4 recto"
"9|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 4 verso"
"10|urn:cite2:citebl:burney86img" ⋯ 77 bytes ⋯ "brary, Burney 86, folio 5 recto"
"11|urn:cite2:citebl:burney86img" ⋯ 77 bytes ⋯ "brary, Burney 86, folio 5 verso"
⋮
"671|urn:cite2:hmt:msB.v1:336r|r" ⋯ 80 bytes ⋯ ":vbbifolio.pending:vb_335v_336r"
"672|urn:cite2:hmt:msB.v1:336v|v" ⋯ 80 bytes ⋯ ":vbbifolio.pending:vb_336v_337r"
"673|urn:cite2:hmt:msB.v1:337r|r" ⋯ 80 bytes ⋯ ":vbbifolio.pending:vb_336v_337r"
"674|urn:cite2:hmt:msB.v1:337v|v" ⋯ 80 bytes ⋯ ":vbbifolio.pending:vb_337v_338r"
"675|urn:cite2:hmt:msB.v1:338r|r" ⋯ 80 bytes ⋯ ":vbbifolio.pending:vb_337v_338r"
"676|urn:cite2:hmt:msB.v1:338v|v" ⋯ 81 bytes ⋯ "vbbifolio.pending:338v_backflyr"
"677|urn:cite2:hmt:msB.v1:backfl" ⋯ 89 bytes ⋯ "vbbifolio.pending:338v_backflyr"
"678|urn:cite2:hmt:msB.v1:backfl" ⋯ 94 bytes ⋯ "olio.pending:backflyv_backcover"
"679|urn:cite2:hmt:msB.v1:backco" ⋯ 92 bytes ⋯ "olio.pending:backflyv_backcover"Of course this works with any CEX source.
b_implementingdata = data_for_model(blks, model)
f_implementingdata = data_for_model(f, model, FileReader)
u_implementingdata = data_for_model(u, model, UrlReader)
s_implementingdata == b_implementingdata == f_implementingdata == u_implementingdatatrueFind a human-readable label for a collection
The cataloglabel function finds the description property of the catalog entry for a given collection. (If the collection is not cataloged, it generates a generic label.)
msb = Cite2Urn("urn:cite2:hmt:msB.v1:")
s_label = cataloglabel(s, msb)"Venetus B manuscript"Or from any other source:
b_label = cataloglabel(blocks(s), msb)
f_label = cataloglabel(f, msb, FileReader)
u_label = cataloglabel(u, msb, UrlReader)
s_label == b_label == f_label == u_labeltrue