Citable collections

Summary

The task: We want to create a type for working with a collection of the citable books we developed on the previous page. We should be able to filter the collection by appying URN logic to the identifiers for our books. We should be able to write our collection to plain-text format and re-instantiate it from the plain-text representation. And we should be able to apply any Julia functions for working with iterable content to our book list.

The implementation:

  • define a new type for a collection of citable books, the ReadingList type
  • identify it as a citable collection (the CitableCollectionTrait)
  • implement filtering the collection using URN logic (the UrnComparisonTrait)
  • implement round-trip serialization (the CexTrait)
  • make the collection available to all Julia functions working with iterable content (Iterators)

Defining the ReadingList

Our model for a reading list is simple: it's just a Vector of citable publications. We'll annotate our vector as containing subtypes of the abstract CitablePublication we previously defined, even though in this example we'll only use our one concrete implementation, the CitableBook. As with our other custom types, we'll override Base.show.

struct ReadingList
    publications::Vector{<: CitablePublication}
end
function show(io::IO, readingList::ReadingList)
    print(io, "ReadingList with ", length(readingList.publications), " items")
end
show (generic function with 288 methods)

Let's see an example.

books = [distantbook, enumerationsbook, wrongbook, qibook]
rl = ReadingList(books)
ReadingList with 4 items

The publications field is just a normal Julia Vector.

rl.publications[4]
Christopher W. Forstall and Walter J. Scheirer, *Quantitative Intertextuality: Analyzing the Markers of Information Reuse* (urn:isbn10:3030234133)

What will make it different from other Vectors is that it will support a series of CITE traits

Implementing the CitableCollectionTrait

We first want to identify our new type as fufilling the requirements of a citable collection with the CitableCollectionTrait. We'll repeat the pattern:

  1. define a singleton type for the trait value.
  2. override the function identifying the trait value for our new type. Here the function is named citablecollectiontrait, and we'll define it to return the concrete value CitableReadingList for the tyupe ReadingList.
struct CitableReadingList <: CitableCollectionTrait end

import CitableBase: citablecollectiontrait
function citablecollectiontrait(::Type{ReadingList})
    CitableReadingList()
end
citablecollectiontrait (generic function with 2 methods)
citablecollectiontrait(typeof(rl))
Main.CitableReadingList()

Use the citablecollection function to test if a specific object is a citable collection.

citablecollection(rl)
true

Like citable objects, citable collections should report the type of URN they use for citation.

import CitableBase: urntype
function urntype(readingList::ReadingList)
    Isbn10Urn
end
urntype (generic function with 4 methods)
urntype(rl)
Main.Isbn10Urn

The promise we now need to fulfill is that our collection will implement three further traits for URN comparison, serialization and iteration.

Implementing the UrnComparisonTrait

We have previously implemented the UrnComparisonTrait for an identifer type (the Isbn10Urn) and for a citable object type (the CitableBook). In both of those cases, we compared two objects of the same type, and returned a boolean result of comparing them on URN logic.

For our citable collection, we will implement the same suite of functions, but with a different signature and result type. This time, our first parameter will be a URN which we will use to filter the collection given in the second parameter. The result will be a (possibly empty) list of content in our citable collection – in this example, a list of CitableBooks.

We mark our ReadingList type as urn-comparable exactly as we did for Isbn10Urns and CitableBooks.

struct ReadingListComparable <: UrnComparisonTrait end
function urncomparisontrait(::Type{ReadingList})
    ReadingListComparable()
end
urncomparisontrait (generic function with 6 methods)
urncomparable(rl)
true

Implementing the required functions urnequals, urncontains and urnsimilar

To implement the required functions, we'll just lean on the work we've already done: we'll use the boolean version of those functions to filter our collections.

function urnequals(urn::Isbn10Urn, reading::ReadingList, )
    filter(item -> urnequals(item.urn, urn), reading.publications)
end

function urncontains(urn::Isbn10Urn, reading::ReadingList)
    filter(item -> urncontains(item.urn, urn), reading.publications)
end

function urnsimilar(urn::Isbn10Urn, reading::ReadingList)
    filter(item -> urnsimilar(item.urn, urn), reading.publications)
end
urnsimilar (generic function with 6 methods)

If your collection does not allow duplicate identifiers, urnequals should return a list of 0 or 1 item.

urnequals(distanthorizons, rl)
1-element Vector{Main.CitableBook}:
 Ted Underwood, *Distant Horizons: Digital Evidence and Literary Change* (urn:isbn10:022661283X)

Three of the books in our list are published in the English-language zone, and therefore will satisfy urnsimilar when compared to Distant Horizons.

urnsimilar(distanthorizons, rl)
3-element Vector{Main.CitableBook}:
 Ted Underwood, *Distant Horizons: Digital Evidence and Literary Change* (urn:isbn10:022661283X)
 Andrew Piper, *Enumerations: Data and Literary Study* (urn:isbn10:022656875X)
 Andrew Piper, *Can We Be Wrong? The Problem of Textual Evidence in a Time of Data* (urn:isbn10:1108922036)

But only two are published in the same ISBN area code as Distant Horizons:

urncontains(distanthorizons, rl)
2-element Vector{Main.CitableBook}:
 Ted Underwood, *Distant Horizons: Digital Evidence and Literary Change* (urn:isbn10:022661283X)
 Andrew Piper, *Enumerations: Data and Literary Study* (urn:isbn10:022656875X)

Implementing the CexTrait

As we did with citable objects, we want to ensure that we can round-trip an entire collection to and from delimited-text format. We'll make our new ReadingList type implement CexTrait in the same way as CitableBook.

struct ReadingListCex <: CexTrait end
function cextrait(::Type{ReadingList})
    ReadingListCex()
end
cextrait (generic function with 5 methods)
cexserializable(rl)
true

Implementing the required functions cex and fromcex

We will serialize our collection with a header line identifying it as citecollection block, followed by one line for each book in our list. We can format the books' data by mapping each book to an invocation the cex that we previously wrote for CitableBooks.

function cex(reading::ReadingList; delimiter = "|")
    header = "#!citecollection\n"
    strings = map(ref -> cex(ref, delimiter=delimiter), reading.publications)
    header * join(strings, "\n")
end
cex (generic function with 5 methods)
cexoutput = cex(rl)
println(cexoutput)
#!citecollection
urn:isbn10:022661283X|Distant Horizons: Digital Evidence and Literary Change|Ted Underwood
urn:isbn10:022656875X|Enumerations: Data and Literary Study|Andrew Piper
urn:isbn10:1108922036|Can We Be Wrong? The Problem of Textual Evidence in a Time of Data|Andrew Piper
urn:isbn10:3030234133|Quantitative Intertextuality: Analyzing the Markers of Information Reuse|Christopher W. Forstall and Walter J. Scheirer

Recall from our experience implementing CEX serialization for CitableBooks that we will need to expose three mandatory parameters for fromcex: the trait value, the CEX data and the Julia type we want to instantiate.

function fromcex(trait::ReadingListCex, cexsrc::AbstractString, T;
    delimiter = "|", configuration = nothing, strict = true)

    lines = split(cexsrc, "\n")
    datalines = filter(ln -> !isempty(ln), lines)
    isbns = CitableBook[]
    inblock = false
    for ln in datalines
        if ln == "#!citecollection"
            inblock = true
        elseif inblock
            bk = fromcex(ln, CitableBook)
            push!(isbns, bk)
        end
    end
    ReadingList(isbns)
end
fromcex (generic function with 9 methods)
Warning

To keep this example brief and avoid introducing other packages, our implementation of fromcex naively assumes cexsrc will contain a single CEX block introduced by the #!citecollection heading. This would break on real world CEX data sources: in a real application, we would instead use the CiteEXchange package to parse and extract appropriate blocks. See the documentation of CiteEXchange, or look at how a package like CitableCorpus uses CiteEXchange in its implementation of fromcex for different data type.

Once again, we can now invoke fromcex with just the parameters for the CEX data and desired Julia type to create, and CitableBase will find our implementation.

fromcex(cexoutput, ReadingList)
ReadingList with 4 items

Free bonus!

CitableBase optionally allows you to include a third parameter to the fromcex function naming the type of reader to apply to the first string parameter. Valid values are StringReader, FileReader or UrlReader. The previous example relied on the default value of StringReader. The following examples use the file RL/test/data/dataset.cex in this repository; its contents are the output of cex(rl) above.

fname = joinpath(root, "RL", "test", "data", "dataset.cex")
fileRL = fromcex(fname, ReadingList, FileReader)
ReadingList with 4 items
url = "https://raw.githubusercontent.com/cite-architecture/CitableBase.jl/dev/RL/test/data/dataset.cex"
urlRL = fromcex(url, ReadingList, UrlReader)
ReadingList with 4 items

Implementing required and optional frnctions from Base.Iterators

The Iterators module in Julia Base was one of the first traits or interfaces in Julia. It allows you to apply the same functions to many types of iterable collections. We need to import the Base.iterate function, and implement two versions of it for our new type: one with a single parameter for the collection, and one with a second parameter maintaining some kind of state information. Both of them have the same return type: either nothing, or a Tuple pairing one item in the collection with state information.

Since our reading list is keeping books in a Vector internally, we can use the state parameter to pass along an index into the Vector. In the version of iterate with no parameters, we'll return the first item in the list, and set the "state" value to 2. In the two-parameter version, we'll return the item indexed by the state count, and bump the count up one.

import Base: iterate

function iterate(rlist::ReadingList)
    isempty(rlist.publications) ? nothing : (rlist.publications[1], 2)
end

function iterate(rlist::ReadingList, state)
    state > length(rlist.publications) ? nothing : (rlist.publications[state], state + 1)
end
iterate (generic function with 254 methods)

It is also useful (and trivial) to implement the optional methods for the length and base type of the collection.

import Base: length
function length(readingList::ReadingList)
    length(readingList.publications)
end


import Base: eltype
function eltype(readingList::ReadingList)
    CitablePublication
end
eltype (generic function with 80 methods)
length(rl)
4
eltype(rl)
Main.CitablePublication

Now our ReadingList type is usable with all the richness of the Julia interface for iterators. Just a few examples:

  • for loops
for item in rl
    println(item)
end
Ted Underwood, *Distant Horizons: Digital Evidence and Literary Change* (urn:isbn10:022661283X)
Andrew Piper, *Enumerations: Data and Literary Study* (urn:isbn10:022656875X)
Andrew Piper, *Can We Be Wrong? The Problem of Textual Evidence in a Time of Data* (urn:isbn10:1108922036)
Christopher W. Forstall and Walter J. Scheirer, *Quantitative Intertextuality: Analyzing the Markers of Information Reuse* (urn:isbn10:3030234133)
  • checking for presence of an item
distantbook in rl
true
  • collect contents without having to know anything about the internal structure of the type
collect(rl)
4-element Vector{Main.CitablePublication}:
 Ted Underwood, *Distant Horizons: Digital Evidence and Literary Change* (urn:isbn10:022661283X)
 Andrew Piper, *Enumerations: Data and Literary Study* (urn:isbn10:022656875X)
 Andrew Piper, *Can We Be Wrong? The Problem of Textual Evidence in a Time of Data* (urn:isbn10:1108922036)
 Christopher W. Forstall and Walter J. Scheirer, *Quantitative Intertextuality: Analyzing the Markers of Information Reuse* (urn:isbn10:3030234133)

More free stuff!

The slidingwindow function does what its name suggests: it creates a Vector of Vectors by sliding a window along a collection.

titles = map(bk -> bk.title, rl)
slidingwindow(titles)
3-element Vector{SubArray{String, 1, Vector{String}, Tuple{UnitRange{Int64}}, true}}:
 ["Distant Horizons: Digital Evidence and Literary Change", "Enumerations: Data and Literary Study"]
 ["Enumerations: Data and Literary Study", "Can We Be Wrong? The Problem of Textual Evidence in a Time of Data"]
 ["Can We Be Wrong? The Problem of Textual Evidence in a Time of Data", "Quantitative Intertextuality: Analyzing the Markers of Information Reuse"]

It can also work directly on a citable collection.

slidingwindow(rl)
3-element Vector{SubArray{Main.CitablePublication, 1, Vector{Main.CitablePublication}, Tuple{UnitRange{Int64}}, true}}:
 [Ted Underwood, *Distant Horizons: Digital Evidence and Literary Change* (urn:isbn10:022661283X), Andrew Piper, *Enumerations: Data and Literary Study* (urn:isbn10:022656875X)]
 [Andrew Piper, *Enumerations: Data and Literary Study* (urn:isbn10:022656875X), Andrew Piper, *Can We Be Wrong? The Problem of Textual Evidence in a Time of Data* (urn:isbn10:1108922036)]
 [Andrew Piper, *Can We Be Wrong? The Problem of Textual Evidence in a Time of Data* (urn:isbn10:1108922036), Christopher W. Forstall and Walter J. Scheirer, *Quantitative Intertextuality: Analyzing the Markers of Information Reuse* (urn:isbn10:3030234133)]

The partitionvect function partitions a Vector into a series of Vectors of a given size. In contrast to slidingwindow, the elements in the new Vectors do not overlap.

partitionvect can work on any generic Vector.

v = collect(1:10)
partitionvect(v)
5-element Vector{SubArray{Int64, 1, Vector{Int64}, Tuple{UnitRange{Int64}}, true}}:
 [1, 2]
 [3, 4]
 [5, 6]
 [7, 8]
 [9, 10]

It also works on any citable collection.

partitionvect(rl)
2-element Vector{SubArray{Main.CitablePublication, 1, Vector{Main.CitablePublication}, Tuple{UnitRange{Int64}}, true}}:
 [Ted Underwood, *Distant Horizons: Digital Evidence and Literary Change* (urn:isbn10:022661283X), Andrew Piper, *Enumerations: Data and Literary Study* (urn:isbn10:022656875X)]
 [Andrew Piper, *Can We Be Wrong? The Problem of Textual Evidence in a Time of Data* (urn:isbn10:1108922036), Christopher W. Forstall and Walter J. Scheirer, *Quantitative Intertextuality: Analyzing the Markers of Information Reuse* (urn:isbn10:3030234133)]

Recap: citable collections

On this page, we wrapped a citable collection type, te ReadingList around a Vector of CitableBooks. We made the type identifiable as a citable collection. We implemented filter of the collection on URN logic with the UrnComparisonTrait, and serialization with the CexSerializableTrait. You can test these for these traits with boolean functions.

citablecollection(rl)
true
urncomparable(rl)
true
cexserializable(rl)
true

In addition, we made the ReadingList implement Julia's Iterators behavior.