StableHashTraits

Use Case and Design Rationale

StableHashTraits is designed to be used in cases where there is an object you wish to serialize in a content-addressed cache. How and when objects pass the same input to a hashing algorithm is meant to be predictable and well defined, so that the user can reliably define methods of transformer to modify this behavior.

What gets hashed?

When you call stable_hash(x; version=4), StableHashTraits hashes both the value x and its type T. Rather than hashing the type T itself directly, in most cases instead StructTypes.StructType(T) is hashed, using StructTypes.jl. For example, since the "StructType" of Float64 and Float32 are both NumberType, when hashing Float64 and Float32 values, value and NumberType are hashed. This provides a simple trait-based system that doesn't need to rely on internal details. See below for more details.

You can customize how the value is hashed using StableHashTraits.transformer, and how its type is hashed using StableHashTraits.transform_type. If you need to customize either of these functions for a type that you don't own, you can use a @context to avoid type piracy.

`StructType.DataType`

StructType.DataType denotes a type that is some kind of "record"; i.e. its content is defined by the fields (getfield(f) for f in fieldnames(T)) of the type. Since it is the default, it is used to hash most types.

To hash the value, each field value (getfield(f) for f in fieldnames(T)) is hashed.

If StructType(T) <: StructTypes.UnorderedStruct (the default), the field values are first sorted by the lexographic order of the field names.

The type of a data type is hashed using string(nameof(T)), the fieldnames(T), (sorting them for UnorderedStruct), along with a hash of the type of each element of fieldtypes(T) according to their StructType.

No type parameters are hashed by default. To hash these you need to specialize on StableHashTraits.transform_type for your struct. Note that because fieldtypes(T) is hashed, you don't need to do this unless your type parameters are not used in the specification of your field types.

`StructType.ArrayType`

ArrayType is used when hashing a sequence of values.

To hash the value, each element of an array type is hashed using iterate. If the object isa AbstractArray, the size(x) of the object is also hashed.

If StableHashTraits.is_ordered returns false the elements are first sorted according to StableHashTraits.hash_sort_by.

To hash the type, the string "StructTypes.ArrayType" is hashed (meaning that the kind of array won't matter to the hash value), and the type of the elype is hashed, according to its StructType. If the type <: AbstractArray, the ndims(T) is hashed.

`StructTypes.DictType`

To hash the value, each key-value pair of a dict type is hashed, as returned by StructType.keyvaluepairs(x).

If StableHashTraits.is_ordered returns false (which is the default return value) the pairs are first sorted according their keys using StableHashTraits.hash_sort_by.

To hash the type, the string "StructTypes.DictType" is hashed (meaning that the kind of dictionary won't matter), and the type of the keytype and valtype is hashed, according to its StructType.

`AbstractRange`

AbstractRange constitutes an exception to the rule that we use StructType: for efficient hashing, ranges are treated as another first-class container object, separate from array types.

The value is hashed as (first(x), step(x), last(x)).

The type is hashed as "Base.AbstractRange" along with the type of the eltype, according to its StructType. Thus, the type of range doesn't matter (just that it is a range).

`StructTypes{Number/String/Bool}Type`

To hash the value, the result of Base.writeing the object is hashed.

To hash the type, the value of string("StructType.", nameof_string(StructType(T)))) is used (c.f. StableHashTraits.nameof_string for details). Note that this means the type of the value itself is not being hashed, rather a string related to its struct type.

`StructType.CustomStruct`

For any StructType.CustomStruct, the object is first StructType.lowered and the result is hashed according to the lowered StructType.

`missing` and `nothing`

There is no value hashed for missing or nothing; the type is hashed as the string "Base.Missing" and "Base.Nothing" respectively. Note in particular the string "Base.Missing" does not have the same hash as missing, since the former would have its struct type hashed.

`StructType.{Null/Singleton}Type`

Null and Singleton types are hashed solely according to their type (no value is hashed)

Their types is hashed by StableHashTraits.nameof_string This means the module of the type does not matter: the module of a type is often considered an implementation detail, so it is left out to avoid unexpected hash changes from non-breaking releases that change the module of a type.

[!NOTE] If you wish to disambiguate functions or types that have the same name but that come from different modules you can overload StableHashTraits.transform_type for those functions. If you want to include the module name for a broad set of types, rather than explicitly specifying a module name for each type, you may want to consider calling StableHashTraits.module_nameof_string in the body of your transform_type method. This can avoid a number of footguns when including the module names: for example, module_nameof_string renames Core to Base to elide Base julia changes to the location of a functions between these two modules and it renames pluto workspace modules to prevent structs from having a different hash each time the notebook is run.

`Function`

Function values are hashed according to their their fields (getfield(f) for f in fieldnames(f)) as per StructType.UnorderedStruct; functions can have fields when they are curried (e.g. ==(2)), and so, for this reason, the fields are included in the hash by default.

The type of the function is hashed according to its StableHashTraits.nameof_string, therefore excluding its module. The exact module of a function is often considered an implementation detail, so it is left out to avoid unexpected hash changes from non-breaking releases that change the module of a function.

`Type`

When hashing a type as a value (e.g. stable_hash(Int; version=4)) the value of `StableHashTraits.nameof_string is hashed. The exact module of a type is often considered an implementation detail, so it is left out to avoid unexpected hash changes from non-breaking releases that change the module of a type.

Examples

All of the following hash examples follow directly from the definitions above, but may not be so obvious to the reader.

Most of the behaviors described below can be customized/changed by using your own hash StableHashTraits.@context, which can be passed as the second argument to stable_hash. StableHashTraits tries to defer to StructTypes for most defaults instead of making more opinionated choices.

The order of NamedTuple pairs does not matter, because NamedTuple has a struct type of UnorderedStruct:

stable_hash((;a=1,b=2); version=4) == stable_hash((;b=2,a=1); version=4)

Two structs with the same fields and name hash equivalently, because the default struct type is UnorderedStruct:

module A
    struct X
        bar::Int
        foo::Float64
    end
end

module B
    struct X
        foo::Float64
        bar::Int
    end
end

stable_hash(B.X(2, 1); version=4) == stable_hash(A.X(1, 2); version=4)

Different array types with the same content hash to the same value:

stable_hash(view([1,2,3], 1:2); version=4) == stable_hash([1,2]; version=4)

Byte-equivalent arrays of all NumberType values will hash to the same value:

stable_hash([0.0, 0.0]; version=4) == stable_hash([0, 0]; version=4)
stable_hash([0.0f0, 0.0f0]; version=4) != stable_hash([0, 0]; version=4) # not byte equivalent

Also, even though the bytes are the same, since the size is hashed, we have:

stable_hash([0.0f0, 0.0f0]; version=4) != stable_hash([0]; version=4)

If the eltype has a different StructType, no collision will occur:

stable_hash(Any[0.0, 0.0]; version=4) != stable_hash([0, 0]; version=4)

Even if the mathematical values are the same, if the bytes are not the same no collision will occur:

stable_hash([1.0, 2.0]; version=4) != stable_hash([1, 2]; version=4)

Two types with the same name but different type parameters will hash the same (unless you define a transform_type_value method for your type to include those type parameters in its return value):

struct MyType{T} end
stable_hash(MyType{:a}) == stable_hash(MyType{:b}) # true

Numerical changes will, of course, change the hash. One way this can catch you off guard are some differences in StaticArray outputs between julia versions:

julia> using StaticArrays, StableHashTraits;

julia> begin
        rotmatrix2d(a) = @SMatrix [cos(a) sin(a); -sin(a) cos(a)]
        rotate(a, p) = rotmatrix2d(a) * p
        rotate((pi / 4), SVector{2}(0.42095778959006, -0.42095778959006))
    end;

In julia 1.9.4:


julia> bytes2hex(stable_hash(rotate((pi / 4), SVector{2}(0.42095778959006, -0.42095778959006)); version=4))
"4ccdc172688dd2b5cd50ba81071a19217c3efe2e3b625e571542004c8f96c797"

julia> rotate((pi / 4), SVector{2}(0.42095778959006, -0.42095778959006))
2-element SVector{2, Float64} with indices SOneTo(2):
  7.419375817039376e-17
 -0.5953242152248626

In julia 1.6.7

julia> bytes2hex(stable_hash(rotate((pi / 4), SVector{2}(0.42095778959006, -0.42095778959006)); version=4))
"3b8d998f3106c05f8b74ee710267775d0d0ce0e6780c1256f4926d3b7dcddf9e"

julia> rotate((pi / 4), SVector{2}(0.42095778959006, -0.42095778959006))
2-element SVector{2, Float64} with indices SOneTo(2):
  5.551115123125783e-17
 -0.5953242152248626

Basic Customization

To customize hashing, you typically want to simply override a method of StableHashTraits.transformer. This should return a function wrapped in a StableHashTraits.Transformer object that will be applied to an object and its result is the actual value that gets hashed.

using StableHashTraits
using StableHashTraits: Transformer
using Dates

struct MyType
   data::Vector{UInt8}
   metadata::Dict{Symbol, Any}
end
# ignore `metadata`, `data` will be hashed using fallbacks for `AbstractArray` type
StableHashTraits.transformer(::Type{<:MyType}) = Transformer(pick_fields(:data))
# NOTE: `pick_fields` is a helper function implemented by `StableHashTraits`
# it creates a named tuple with the given object fields; in the above code it is used
# in its curried form e.g. `pick_fields(:data)` is the same as `x -> pick_fields(x, :data)`
a = MyType(read("myfile.txt"), Dict{Symbol, Any}(:read => Dates.now()))
b = MyType(read("myfile.txt"), Dict{Symbol, Any}(:read => Dates.now()))
stable_hash(a; version=4) == stable_hash(b; version=4) # true

!!! note "Use pick_fields and omit_fields It is recommended you use pick_fields or omit_fields when you simply want to select some subset of fields to be hashed, as they allow for more optimized hashes than directly returning a named tuple of a field subset.

StableHashTraits.Transformer takes a second positional argument which is the StructType you wish to use on the transformed return value. By default StructType is applied to the result to determine this automatically, but in some cases it can be useful to modify this trait by passing a second argument.

Avoid StackOverflow using `TransformIdentity`

Make sure you don't return the object itself as an element of some collection. It can be tempting to do e.g. (mymetadata(x), x) as a return value for transformer's function. Instead you can use StableHashTraits.TransformIdentity to make sure this won't lead to an infinite regress: e.g. (mymetadata(x), TransformIdentity(x)). Using StableHashTraits.TransformIdentity will cause x's transformed result to be x itself, thereby avoiding the infinite regress.

transformer customizes how the content of your object is hashed. The hash of the type is customized separately using StableHashTraits.transform_type.

Using Contexts

Because not every package knows about either StableHashTraits or StructTypes, there may be types you don't own that you want to customize the hash of. In this scenario you should define a context object that you pass as the second argument to stable_hash and define a method of transformer that dispatches on this context object as its second argument.

A context is simply an arbitrary object that defines a method for StableHashTraits.parent_context. By default the context to stable_hash is HashVersion{version}(). Because of parent_context, contexts can be stacked, and a HashVersion should be at the bottom of the stack. There are fallback methods for transformer that look at the value implemented by the parent context. In this way you need only define methods for the types you want to customize.

For example, this customization makes the ordering of named tuple keys affect the hash value.

julia> begin
        struct NamedTuplesOrdered{T}
            parent::T
        end
        StableHashTraits.parent_context(x::NamedTuplesOrdered) = x.parent
        function transformer(::Type{<:NamedTuple}, ::NamedTuplesOrdered)
            Transformer(identity, StructTypes.OrderedStruct())
        end
        context = NamedTuplesOrdered(HashVersion{4}())
       end;

julia> stable_hash((; a=1:2, b=1:2), context) != stable_hash((; b=1:2, a=1:2), context)
true

Without this context, the keys are first sorted because StructType(NamedTuple) isa StructType.UnorderedStruct.

As a short hand you can use StableHashTraits.@context for creating simple contexts, like the one above. Like so:

julia> begin
        @context NamedTuplesOrdered
        function transformer(::Type{<:NamedTuple}, ::NamedTuplesOrdered)
            Transformer(identity, StructTypes.OrderedStruct())
        end
        context = NamedTuplesOrdered(HashVersion{4}())
       end;

julia> stable_hash((; a=1:2, b=1:2), context) != stable_hash((; b=1:2, a=1:2), context)
true

There are several useful, predefined contexts available in StableHashTraits that can be used to change how hashing works:

Optimizing Custom Transformers

By default, stable hash traits follows a safe, but slower code path for arbitrary functions passed to Transformer. However, in some cases it can use a faster code path, given that some assumptions about the types returned by the transforming function are maintained. The identity function and the functions returned by pick_fields and omit_fields use this faster code path by default.

In particular, a keyword argument to Transformer, hoist_type can be set to true to use this faster code path. Functions that implement StableHashTraits.hoist_type(::typeof(fn)) can return true to signal that they are safe when using this faster code path. This function is called to determine the default value of hoist_type when this function is passed as the first argument to Transformer.

The exact criteria for when this code path is unsafe are somewhat subtle, and will be described below, along with some examples. However, you can always safely use hoist_type=true either when the function always returns the same type (e.g. it transforms all inputs into String values) OR when the following three criteria are met:

The type that transformer dispatches on is concrete.
The type that transformer dispatches on contains no abstract types: that is, for any contained array type or dict types, their eltypes are concrete and for any contained data type, its fieldtypes are concrete.
The function you pass to Transformer is type stable

When set to true, hoist_type=true hashes the type of the pre-transformed object, prior to looping over the contents of the post-transformed object: its fields (for a data type) or the elements (for an array or dict type). Once the contents of the object are being looped over, hashing the type of each concrete-typed element or field is skipped. For example, when hashing an Array{Int} the Int will only be hashed once, not once for every element.

When hoist_type=false (the default for most functions) the type of the return value from transformer is hashed alongside the transformed value. This can be a lot slower, but ensures the that no unexpected hash collisions will occur.

More precisely, this hoisting is only valid when one of these two criteria are satisfied:

the pre-transformed type is sufficient to disambiguate the hash of the downstream object values absent their object types.
the post-transformed types do not change unless the caller inferred type of the input it depends on changes

The latter criteria is more stringent than type stability. A function input could have a caller-inferred type of Any, be type stable, and return either Char or an Int depending on the value of its input. Such a function would violate this second criteria.

Examples

When is the pre-transformed type sufficient to disambiguate hashed values? First, many type-unstable functions should be considered unable to meet this criteria. The only time they are certain to be safe is when the values disambiguate the hash regardless of the type.

For instance, the assmuptions of hoist_type=true would be violated by the function x -> x < 0 : Char(0) ? Int32(0) because the bits that are hashed downstream are identical, even though the type information that should be hashed with them are different: Char is a StringType and Ing32 is a NumberType.

In contrast, x -> x < 0 : Char(1) : Int32(2) is safe to use with hoist_type=true, because although the type changes, the byte sequence of the value never overlaps, regardless of the type returned. If you are confident the bit sequence will be unique in this way, you could safely use hoist_type=true even though the transformer is type unstable.

Beware! When a type unstable function will be unsafe for a given type depends on the context, because users can define their own type_transform that can lead to more type details being important to the hashed value. For instance, in the default context x -> x < 0 : Int32(0) ? UInt32(0) would be considered safe, since both Int32 and UInt32 have the same type for purposes of hashing (NumberType), but if the user were to write a custom transform_type in a HashExactNumberType context, now this function is no longer safe to use with hoist_type=true.

These examples hopefully help to clarify when type-unstable functions can lead to unexpected hash collisions with hoist_type=true. However type stable functions can also lead to invalid hashes with hoist_type=true. For example:

struct MyType
    a::Any
end
# ignore `metadata`, `data` will be hashed using fallbacks for `AbstractArray` type
# DO NOT DO IT THIS WAY; YOUR CODE WILL BE BUGGY!!!
StableHashTraits.transformer(::Type{<:MyType}) = Transformer(x -> (;x.a); hoist_type=true)
stable_hash(MyType(missing)) == stable_hash(MyType(nothing)) ## OOPS, we wanted these to be different but this returns `true`

Setting the flag to hoist_type=true here causes the type of the missing and nothing to be hoisted, since the field a is a concrete type in the return value of Transformer's function. Since only the type of these two values is hashed, their hashes now collide. If the field of the pre-transformed type a was concrete, there wouldn't be any problem, because its type would be hashed, and that would include either the Missing or Nothing in the type hash.

For this reason, it is better to use pick_fields and omit_fields to select or remove fields from a type you want to transform.

struct MyType
    a::Any
end
StableHashTraits.transformer(::Type{<:MyType}) = Transformer(x -> (;x.a); hoist_type=false)
stable_hash(MyType(missing)) != stable_hash(MyType(nothing)) # true

# NOTE: pick_fields sets `hoist_type=true` by default; it is set here explicitly to
# clearly illustrate
# what is happening.
StableHashTraits.transformer(::Type{<:MyType}) = Transformer(pick_fields(:a); hoist_type=true)
stable_hash(MyType(missing)) != stable_hash(MyType(nothing)) # this is also `true` and is faster than the previous implementation

Bottom line: It is not sufficient for the function to be type stable to safely set hoist_type=true. If the return value of your transformer function over its known domain returns multiple distinct concrete types, you can run into problems.

Customizing Type Hashes

Types are hashed by hashing the return value of StableHashTraits.transform_type when hashing an object's type and the return value of StableHashTraits.transform_type_value when hashing a type as a value (e.g. stable_hash(Int)). The docs for these functions provide several examples of their usage.

In addition there is some structure of the type that is always hashed:

fieldtypes(T) of any StructType.DataType (e.g. StructType.Struct)
eltype(T) of any StructType.ArrayType or StructType.DictType or AbstractRange

These get added internally so as to ensure that the type-hoisting described above can rely on eltypes and fieldtypes storing all downstream children's concrete types.