Utilities for working with citable collections
In the CEX format, citedata
blocks contain tables of data. Metadata about the structure of these data can be added in citeproperties
blocks. Semantic models for those structures can be specified in datamodels
blocks. The CexUtils
submodule simplifies coordinating these three facets of a collection: data, structural metadata, semantic model.
Data used in these examples
The utility functions documented here are designed to work identically with multiple kinds of CEX sources: pure CEX strings, blocks of CEX data parsed into the CiteEXchange
module's Block
structure, files with CEX data, or CEX content located at a URL.
The examples in the following pages are illustrated using the hmt-2022k.cex
published release of the Homer Multitext project, a complex CEX source with roughly 18 Mb of plain-text data. A copy of that file is in the test/data
directory of this repository.
f = joinpath(root, "test", "data", "hmt-2022k.cex")
s = read(f) |> String
u = "https://raw.githubusercontent.com/cite-architecture/CitableObject.jl/main/test/data/hmt-2022k.cex"
using CiteEXchange
blks = blocks(s)
# Length in characters:
length(s)
17992785
Find properties of a citable collection
urn:cite2:hmt:msB.v1:
identifies a collection of images of a particular manuscript. Let's use the CEX string to find structural metadata – the properties of that collection. (Concretely, that means finding the relevant content of all citeproperties
blocks in the CEX source).
using CitableObject
using CitableObject.CexUtils
msbimgs = Cite2Urn("urn:cite2:hmt:msB.v1:")
sprops = properties(s, msbimgs)
5-element Vector{SubString{String}}:
"urn:cite2:hmt:msB.v1.sequence:|Page sequence|Number|"
"urn:cite2:hmt:msB.v1.urn:|URN|Cite2Urn|"
"urn:cite2:hmt:msB.v1.rv:|Recto or Verso|String|recto,verso"
"urn:cite2:hmt:msB.v1.label:|Label|String|"
"urn:cite2:hmt:msB.v1.image:|TBS image|Cite2Urn|"
We get the same result if we read a Vector of Block
s.
bprops = properties(blks, msbimgs)
sprops == bprops
true
We can also read from files or URL sources.
using CitableBase: FileReader
fprops = properties(f, msbimgs, FileReader)
using CitableBase: UrlReader
uprops = properties(u, msbimgs, UrlReader)
fprops == uprops == sprops
true
Find data lines of a citable collection
Now let's find the data for the same collection (that is, the relevant content of all citedata
blocks in the CEX source).
We can use a pure string of CEX data.
sdata = collectiondata(s, msbimgs)
length(sdata)
683
Or again, we can read from files, URLs, or lists of Block
s.
bdata = collectiondata(blks, msbimgs)
fdata = collectiondata(f, msbimgs, FileReader)
udata = collectiondata(u, msbimgs, UrlReader)
length(sdata) == length(bdata) == length(fdata) == length(udata)
true
Find collections implementing a datamodel
Like everything else in the CITE architecture, we identify data models with a URN. The Homer Multitext project defines a data model for the structure of a manuscript (or codex).
model = Cite2Urn("urn:cite2:hmt:datamodels.v1:codexmodel")
urn:cite2:hmt:datamodels.v1:codexmodel
We can find Cite2Urn
s for all collections implementing this model.
implementations(s, model)
7-element Vector{Cite2Urn}:
urn:cite2:citebl:burney86pages.v1:
urn:cite2:hmt:e3pages.v1:
urn:cite2:hmt:e4pages.v1:
urn:cite2:citelaur:laur32pages.v1:
urn:cite2:hmt:u4pages.v1:
urn:cite2:hmt:msA.v1:
urn:cite2:hmt:msB.v1:
The results are the same no matter what kind of source we read from.
bmodelurns = implementations(blks, msbimgs)
fmodelurns = implementations(f, msbimgs, FileReader)
umodelurns = implementations(u, msbimgs, UrlReader)
bmodelurns == fmodelurns == umodelurns
true
Find collection data for a datamodel
We could manually collect URNs for collections implementing a data model, then find data for each collection, but with the data_for_model
function we can collect all data for all collections implementing a model in a single step.
s_implementingdata = data_for_model(s, model)
4254-element Vector{SubString{String}}:
"2|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 1 recto"
"3|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 1 verso"
"4|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 2 recto"
"5|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 2 verso"
"6|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 3 recto"
"7|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 3 verso"
"8|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 4 recto"
"9|urn:cite2:citebl:burney86imgs" ⋯ 76 bytes ⋯ "brary, Burney 86, folio 4 verso"
"10|urn:cite2:citebl:burney86img" ⋯ 77 bytes ⋯ "brary, Burney 86, folio 5 recto"
"11|urn:cite2:citebl:burney86img" ⋯ 77 bytes ⋯ "brary, Burney 86, folio 5 verso"
⋮
"671|urn:cite2:hmt:msB.v1:336r|r" ⋯ 80 bytes ⋯ ":vbbifolio.pending:vb_335v_336r"
"672|urn:cite2:hmt:msB.v1:336v|v" ⋯ 80 bytes ⋯ ":vbbifolio.pending:vb_336v_337r"
"673|urn:cite2:hmt:msB.v1:337r|r" ⋯ 80 bytes ⋯ ":vbbifolio.pending:vb_336v_337r"
"674|urn:cite2:hmt:msB.v1:337v|v" ⋯ 80 bytes ⋯ ":vbbifolio.pending:vb_337v_338r"
"675|urn:cite2:hmt:msB.v1:338r|r" ⋯ 80 bytes ⋯ ":vbbifolio.pending:vb_337v_338r"
"676|urn:cite2:hmt:msB.v1:338v|v" ⋯ 81 bytes ⋯ "vbbifolio.pending:338v_backflyr"
"677|urn:cite2:hmt:msB.v1:backfl" ⋯ 89 bytes ⋯ "vbbifolio.pending:338v_backflyr"
"678|urn:cite2:hmt:msB.v1:backfl" ⋯ 94 bytes ⋯ "olio.pending:backflyv_backcover"
"679|urn:cite2:hmt:msB.v1:backco" ⋯ 92 bytes ⋯ "olio.pending:backflyv_backcover"
Of course this works with any CEX source.
b_implementingdata = data_for_model(blks, model)
f_implementingdata = data_for_model(f, model, FileReader)
u_implementingdata = data_for_model(u, model, UrlReader)
s_implementingdata == b_implementingdata == f_implementingdata == u_implementingdata
true
Find a human-readable label for a collection
The cataloglabel
function finds the description property of the catalog entry for a given collection. (If the collection is not cataloged, it generates a generic label.)
msb = Cite2Urn("urn:cite2:hmt:msB.v1:")
s_label = cataloglabel(s, msb)
"Venetus B manuscript"
Or from any other source:
b_label = cataloglabel(blocks(s), msb)
f_label = cataloglabel(f, msb, FileReader)
u_label = cataloglabel(u, msb, UrlReader)
s_label == b_label == f_label == u_label
true