API Documentation

Note

If you're a newcomer to Legolas.jl, please familiarize yourself with the tour before diving into this documentation.

Legolas Schemas

Legolas.SchemaVersionType
Legolas.SchemaVersion{name,version}

A type representing a particular version of Legolas schema. The relevant name (a Symbol) and version (an Integer) are surfaced as type parameters, allowing them to be utilized for dispatch.

For more details and examples, please see Legolas.jl/examples/tour.jl and the "Schema-Related Concepts/Conventions" section of the Legolas.jl documentation.

The constructor SchemaVersion{name,version}() will throw an ArgumentError if version is negative.

See also: Legolas.@schema

source
Legolas.@schemaMacro
@schema "name" Prefix

Declare a Legolas schema with the given name. Types generated by subsequent @version declarations for this schema will be prefixed with Prefix.

For more details and examples, please see Legolas.jl/examples/tour.jl.

source
Legolas.@versionMacro
@version RecordType begin
    declared_field_expression_1
    declared_field_expression_2
    ⋮
end

@version RecordType > ParentRecordType begin
    declared_field_expression_1
    declared_field_expression_2
    ⋮
end

Given a prior @schema declaration of the form:

@schema "example.name" Name

...the nth version of example.name can be declared in the same module via a @version declaration of the form:

@version NameV$(n) begin
    declared_field_expression_1
    declared_field_expression_2
    ⋮
end

...which generates types definitions for the NameV$(n) type (a Legolas.AbstractRecord subtype) and NameV$(n)SchemaVersion type (an alias of typeof(SchemaVersion("example.name", n))), as well as the necessary definitions to overload relevant Legolas methods with specialized behaviors in accordance with the declared fields.

If the declared schema version has a parent, it should be specified via the optional > ParentRecordType clause. ParentRecordType should refer directly to an existing Legolas-generated record type.

Each declared_field_expression declares a field of the schema version, and is an expression of the form field::F = rhs where:

  • field is the corresponding field's name
  • ::F denotes the field's type constraint (if elided, defaults to ::Any).
  • rhs is the expression which produces field::F (if elided, defaults to field).

Accounting for all of the aforementioned allowed elisions, valid declared_field_expressions include:

  • field::F = rhs
  • field::F (interpreted as field::F = field)
  • field = rhs (interpreted as field::Any = rhs)
  • field (interpreted as field::Any = field)

F is generally a type literal, but may also be an expression of the form (<:T), in which case the declared schema version's generated record type will expose a type parameter (constrained to be a subtype of T) for the given field. For example:

julia> @schema "example.foo" Foo

julia> @version FooV1 begin
           x::Int
           y::(<:Real)
       end

julia> FooV1(x=1, y=2.0)
FooV1{Float64}: (x = 1, y = 2.0)

julia> FooV1{Float32}(x=1, y=2)
FooV1{Float32}: (x = 1, y = 2.0f0)

julia> FooV1(x=1, y="bad")
ERROR: TypeError: in FooV1, in _y_T, expected _y_T<:Real, got Type{String}

This macro will throw a Legolas.SchemaVersionDeclarationError if:

  • The provided RecordType does not follow the $(Prefix)V$(n) format, where Prefix was previously associated with a given schema by a prior @schema declaration.
  • There are no declared field expressions, duplicate fields are declared, or a given declared field expression is invalid.
  • (if a parent is specified) The @version declaration does not comply with its parent's @version declaration, or the parent hasn't yet been declared at all.

Note that this macro expects to be evaluated within top-level scope.

For more details and examples, please see Legolas.jl/examples/tour.jl and the "Schema-Related Concepts/Conventions" section of the Legolas.jl documentation.

source
Legolas.@checkMacro
@check expr

Define a constraint for a schema version (e.g. @check x > 0) from a boolean expression. The expr should evaulate to true if the constraint is met or false if the constraint is violated. Multiple constraints may be defined for a schema version. All @check constraints defined with a @version must follow all fields defined by the schema version.

For more details and examples, please see Legolas.jl/examples/tour.jl.

source
Legolas.is_valid_schema_nameFunction
Legolas.is_valid_schema_name(x::AbstractString)

Return true if x is a valid schema name, return false otherwise.

Valid schema names are lowercase, alphanumeric, and may contain hyphens or periods.

source
Legolas.parse_identifierFunction
Legolas.parse_identifier(id::AbstractString)

Given a valid schema version identifier id of the form:

$(names[1])@$(versions[1]) > $(names[2])@$(versions[2]) > ... > $(names[n])@$(versions[n])

return an n element Vector{SchemaVersion} whose ith element is SchemaVersion(names[i], versions[i]).

Throws an ArgumentError if the provided string is not a valid schema version identifier.

For details regarding valid schema version identifiers and their structure, see the "Schema-Related Concepts/Conventions" section of the Legolas.jl documentation.

source
Legolas.identifierFunction
Legolas.identifier(::Legolas.SchemaVersion)

Return this Legolas.SchemaVersion's fully qualified schema version identifier. This string is serialized as the "legolas_schema_qualified" field value in table metadata for table written via Legolas.write.

source
Legolas.schema_providerFunction
Legolas.schema_provider(::SchemaVersion)

Returns a NamedTuple with keys name and version. The name is a Symbol corresponding to the package which defines the schema version, if known; otherwise nothing. Likewise the version is a VersionNumber or nothing.

source
Legolas.parentFunction
Legolas.parent(sv::Legolas.SchemaVersion)

Return the Legolas.SchemaVersion instance that corresponds to sv's declared parent.

source
Legolas.declared_fieldsFunction
Legolas.declared_fields(sv::Legolas.SchemaVersion)

Return a NamedTuple{...,Tuple{Vararg{DataType}} whose fields take the form:

<name of field declared by `sv`> = <field's type>

If sv has a parent, the returned fields will include declared_fields(parent(sv)).

source
Legolas.declarationFunction
Legolas.declaration(sv::Legolas.SchemaVersion)

Return a Pair{String,Vector{NamedTuple}} of the form

schema_version_identifier::String => declared_field_infos::Vector{Legolas.DeclaredFieldInfo}

where DeclaredFieldInfo has the fields:

  • name::Symbol: the declared field's name
  • type::Union{Symbol,Expr}: the declared field's declared type constraint
  • parameterize::Bool: whether or not the declared field is exposed as a parameter
  • statement::Expr: the declared field's full assignment statement (as processed by @version, not necessarily as written)

Note that declaration is primarily intended to be used for interactive discovery purposes, and does not include the contents of declaration(parent(sv)).

source
Legolas.declaredFunction
Legolas.declared(sv::Legolas.SchemaVersion{name,version})

Return true if the schema version name@version has been declared via @version in the current Julia session; return false otherwise.

source
Legolas.find_violationFunction
Legolas.find_violation(ts::Tables.Schema, sv::Legolas.SchemaVersion)

For each field f::F declared by sv:

  • Define A = Legolas.accepted_field_type(sv, F)
  • If f::T is present in ts, ensure that T <: A or else immediately return f::Symbol => T::DataType.
  • If f isn't present in ts, ensure that Missing <: A or else immediately return f::Symbol => missing::Missing.

Otherwise, return nothing.

To return all violations instead of just the first, use Legolas.find_violations.

See also: Legolas.validate, Legolas.complies_with, Legolas.find_violations.

source
Legolas.find_violationsFunction
Legolas.find_violations(ts::Tables.Schema, sv::Legolas.SchemaVersion)

Return a Vector{Pair{Symbol,Union{Type,Missing}}} of all of ts's violations with respect to sv.

This function's notion of "violation" is defined by Legolas.find_violation, which immediately returns the first violation found; prefer to use that function instead of find_violations in situations where you only need to detect any violation instead of all violations.

See also: Legolas.validate, Legolas.complies_with, Legolas.find_violation.

source
Legolas.accepted_field_typeFunction
Legolas.accepted_field_type(sv::Legolas.SchemaVersion, T::Type)

Return the "maximal supertype" of T that is accepted by sv when evaluating a field of type >:T for schematic compliance via Legolas.find_violation; see that function's docstring for an explanation of this function's use in context.

SchemaVersion authors may overload this function to broaden particular type constraints that determine schematic compliance for their SchemaVersion, without needing to broaden the type constraints employed by their SchemaVersion's record type.

Legolas itself defines the following default overloads:

accepted_field_type(::SchemaVersion, T::Type) = T
accepted_field_type(::SchemaVersion, ::Type{Any}) = Any
accepted_field_type(::SchemaVersion, ::Type{UUID}) = Union{UUID,UInt128}
accepted_field_type(::SchemaVersion, ::Type{Symbol}) = Union{Symbol,AbstractString}
accepted_field_type(::SchemaVersion, ::Type{String}) = AbstractString
accepted_field_type(sv::SchemaVersion, ::Type{<:Vector{T}}) where T = AbstractVector{<:(accepted_field_type(sv, T))}
accepted_field_type(::SchemaVersion, ::Type{Vector}) = AbstractVector
accepted_field_type(sv::SchemaVersion, ::Type{Union{T,Missing}}) where {T} = Union{accepted_field_type(sv, T),Missing}

Outside of these default overloads, this function should only be overloaded against specific SchemaVersions that are authored within the same module as the overload definition; to do otherwise constitutes type piracy and should be avoided.

source

Validating/Writing/Reading Legolas Tables

Legolas.extract_schema_versionFunction
Legolas.extract_schema_version(table)

Attempt to extract Arrow metadata from table via Arrow.getmetadata(table).

If Arrow metadata is present and contains "legolas_schema_qualified" => s, return first(parse_identifier(s))

Otherwise, return nothing.

source
Legolas.writeFunction
Legolas.write(io_or_path, table, sv::SchemaVersion; validate::Bool=true, kwargs...)

Write table to io_or_path, inserting the appropriate legolas_schema_qualified field in the written out Arrow metadata.

If validate is true, Legolas.validate(Tables.schema(table), vs) will be invoked before the table is written out to io_or_path.

Any other provided kwargs are forwarded to an internal invocation of Arrow.write.

Note that io_or_path may be any type that supports Base.write(io_or_path, bytes::Vector{UInt8}).

source
Legolas.readFunction
Legolas.read(io_or_path; validate::Bool=true)

Read and return an Arrow.Table from io_or_path.

If validate is true, Legolas.read will attempt to extract a Legolas.SchemaVersion from the deserialized Arrow.Table's metadata and use Legolas.validate to verify that the table's Table.Schema complies with the extracted Legolas.SchemaVersion before returning the table.

Note that io_or_path may be any type that supports Base.read(io_or_path)::Vector{UInt8}.

source
Legolas.tobufferFunction
Legolas.tobuffer(args...; kwargs...)

A convenience function that constructs a fresh io::IOBuffer, calls Legolas.write(io, args...; kwargs...), and returns seekstart(io).

Analogous to the Arrow.tobuffer function.

source

Utilities

Legolas.liftFunction
lift(f, x)

Return f(x) unless x isa Union{Nothing,Missing}, in which case return missing.

This is particularly useful when handling values from Arrow.Table, whose null values may present as either missing or nothing depending on how the table itself was originally constructed.

See also: construct

source
lift(f)

Returns a curried function, x -> lift(f,x)

source
Legolas.constructFunction
construct(T::Type, x)

Construct T(x) unless x is of type T, in which case return x itself. Useful in conjunction with the lift function for types which don't have a constructor which accepts instances of itself (e.g. T(::T)).

Examples

julia> using Legolas: construct

julia> construct(Float64, 1)
1.0

julia> Some(Some(1))
Some(Some(1))

julia> construct(Some, Some(1))
Some(1)

Use the curried form when using lift:

julia> using Legolas: lift, construct

julia> lift(Some, Some(1))
Some(Some(1))

julia> lift(construct(Some), Some(1))
Some(1)
source
Legolas.record_mergeFunction
record_merge(record::AbstractRecord; fields_to_merge...)

Return a new AbstractRecord with the same schema version as record, whose fields are computed via Tables.rowmerge(record; fields_to_merge...). The returned record is constructed by passing these merged fields to the AbstractRecord constructor that matches the type of the input record.

source
Legolas.gatherFunction
Legolas.gather(column_name, tables...; extract=((table, idxs) -> view(table, idxs, :)))

Gather rows from tables into a unified cross-table index along column_name. Returns a Dict whose keys are the unique values of column_name across tables, and whose values are tuples of the form:

(rows_matching_key_in_table_1, rows_matching_key_in_table_2, ...)

The provided extract function is used to extract rows from each table; it takes as input a table and a Vector{Int} of row indices, and returns the corresponding subtable. The default definition is sufficient for DataFrames tables.

Note that this function may internally call Tables.columns on each input table, so it may be slower and/or require more memory if any(!Tables.columnaccess, tables).

Note that we intend to eventually migrate this function from Legolas.jl to a more appropriate package.

source
Legolas.locationsFunction
locations(collections::Tuple)

Return a Dict whose keys are the set of all elements across all provided collections, and whose values are the indices that locate each corresponding element across all provided collecitons.

Specifically, locations(collections)[k][i] will return a Vector{Int} whose elements are the index locations of k in collections[i]. If !(k in collections[i]), this Vector{Int} will be empty.

For example:

julia> Legolas.locations((['a', 'b', 'c', 'f', 'b'],
                          ['d', 'c', 'e', 'b'],
                          ['f', 'a', 'f']))
Dict{Char, Tuple{Vector{Int64}, Vector{Int64}, Vector{Int64}}} with 6 entries:
  'f' => ([4], [], [1, 3])
  'a' => ([1], [], [2])
  'c' => ([3], [2], [])
  'd' => ([], [1], [])
  'e' => ([], [3], [])
  'b' => ([2, 5], [4], [])

This function is useful as a building block for higher-level tabular operations that require indexing/grouping along specific sets of elements.

source
Legolas.materializeFunction
Legolas.materialize(table)

Return a fully deserialized copy of table.

This function is useful when table has built-in deserialize-on-access or conversion-on-access behavior (like Arrow.Table) and you'd like to pay such access costs upfront before repeatedly accessing the table.

Note that we intend to eventually migrate this function from Legolas.jl to a more appropriate package.

source