StableHashTraits
Use Case and Design Rationale
StableHashTraits is designed to be used in cases where there is an object you wish to serialize in a content-addressed cache. How and when objects pass the same input to a hashing algorithm is meant to be predictable and well defined, so that the user can reliably define methods of transformer
to modify this behavior.
What gets hashed?
When you call stable_hash(x; version=4)
, StableHashTraits hashes both the value x
and its type T
. Rather than hashing the type T
itself directly, in most cases instead StructTypes.StructType(T)
is hashed, using StructTypes.jl. For example, since the "StructType" of Float64 and Float32 are both NumberType
, when hashing Float64 and Float32 values, value and NumberType
are hashed. This provides a simple trait-based system that doesn't need to rely on internal details. See below for more details.
You can customize how the value is hashed using StableHashTraits.transformer
, and how its type is hashed using StableHashTraits.transform_type
. If you need to customize either of these functions for a type that you don't own, you can use a @context to avoid type piracy.
StructType.DataType
StructType.DataType
denotes a type that is some kind of "record"; i.e. its content is defined by the fields (getfield(f) for f in fieldnames(T)
) of the type. Since it is the default, it is used to hash most types.
To hash the value, each field value (getfield(f) for f in fieldnames(T)
) is hashed.
If StructType(T) <: StructTypes.UnorderedStruct
(the default), the field values are first sorted by the lexographic order of the field names.
The type of a data type is hashed using string(nameof(T))
, the fieldnames(T)
, (sorting them for UnorderedStruct
), along with a hash of the type of each element of fieldtypes(T)
according to their StructType
.
No type parameters are hashed by default. To hash these you need to specialize on StableHashTraits.transform_type
for your struct. Note that because fieldtypes(T)
is hashed, you don't need to do this unless your type parameters are not used in the specification of your field types.
StructType.ArrayType
ArrayType
is used when hashing a sequence of values.
To hash the value, each element of an array type is hashed using iterate
. If the object isa AbstractArray
, the size(x)
of the object is also hashed.
If StableHashTraits.is_ordered
returns false
the elements are first sort
ed according to StableHashTraits.hash_sort_by
.
To hash the type, the string "StructTypes.ArrayType"
is hashed (meaning that the kind of array won't matter to the hash value), and the type of the elype
is hashed, according to its StructType
. If the type <: AbstractArray
, the ndims(T)
is hashed.
StructTypes.DictType
To hash the value, each key-value pair of a dict type is hashed, as returned by StructType.keyvaluepairs(x)
.
If StableHashTraits.is_ordered
returns false
(which is the default return value) the pairs are first sort
ed according their keys using StableHashTraits.hash_sort_by
.
To hash the type, the string "StructTypes.DictType"
is hashed (meaning that the kind of dictionary won't matter), and the type of the keytype
and valtype
is hashed, according to its StructType
.
AbstractRange
AbstractRange
constitutes an exception to the rule that we use StructType
: for efficient hashing, ranges are treated as another first-class container object, separate from array types.
The value is hashed as (first(x), step(x), last(x))
.
The type is hashed as "Base.AbstractRange"
along with the type of the eltype
, according to its StructType
. Thus, the type of range doesn't matter (just that it is a range).
StructTypes{Number/String/Bool}Type
To hash the value, the result of Base.write
ing the object is hashed.
To hash the type, the value of string("StructType.", nameof_string(StructType(T))))
is used (c.f. StableHashTraits.nameof_string
for details). Note that this means the type of the value itself is not being hashed, rather a string related to its struct type.
StructType.CustomStruct
For any StructType.CustomStruct
, the object is first StructType.lower
ed and the result is hashed according to the lowered StructType
.
missing
and nothing
There is no value hashed for missing
or nothing
; the type is hashed as the string "Base.Missing"
and "Base.Nothing"
respectively. Note in particular the string "Base.Missing"
does not have the same hash as missing
, since the former would have its struct type hashed.
StructType.{Null/Singleton}Type
Null and Singleton types are hashed solely according to their type (no value is hashed)
Their types is hashed by StableHashTraits.nameof_string
This means the module of the type does not matter: the module of a type is often considered an implementation detail, so it is left out to avoid unexpected hash changes from non-breaking releases that change the module of a type.
[!NOTE] If you wish to disambiguate functions or types that have the same name but that come from different modules you can overload
StableHashTraits.transform_type
for those functions. If you want to include the module name for a broad set of types, rather than explicitly specifying a module name for each type, you may want to consider callingStableHashTraits.module_nameof_string
in the body of yourtransform_type
method. This can avoid a number of footguns when including the module names: for example,module_nameof_string
renamesCore
toBase
to elide Base julia changes to the location of a functions between these two modules and it renames pluto workspace modules to prevent structs from having a different hash each time the notebook is run.
Function
Function values are hashed according to their their fields (getfield(f) for f in fieldnames(f)
) as per StructType.UnorderedStruct
; functions can have fields when they are curried (e.g. ==(2)
), and so, for this reason, the fields are included in the hash by default.
The type of the function is hashed according to its StableHashTraits.nameof_string
, therefore excluding its module. The exact module of a function is often considered an implementation detail, so it is left out to avoid unexpected hash changes from non-breaking releases that change the module of a function.
Type
When hashing a type as a value (e.g. stable_hash(Int; version=4)
) the value of `StableHashTraits.nameof_string is hashed. The exact module of a type is often considered an implementation detail, so it is left out to avoid unexpected hash changes from non-breaking releases that change the module of a type.
Examples
All of the following hash examples follow directly from the definitions above, but may not be so obvious to the reader.
Most of the behaviors described below can be customized/changed by using your own hash StableHashTraits.@context
, which can be passed as the second argument to stable_hash
. StableHashTraits tries to defer to StructTypes for most defaults instead of making more opinionated choices.
The order of NamedTuple pairs does not matter, because NamedTuple
has a struct type of UnorderedStruct
:
stable_hash((;a=1,b=2); version=4) == stable_hash((;b=2,a=1); version=4)
Two structs with the same fields and name hash equivalently, because the default struct type is UnorderedStruct
:
module A
struct X
bar::Int
foo::Float64
end
end
module B
struct X
foo::Float64
bar::Int
end
end
stable_hash(B.X(2, 1); version=4) == stable_hash(A.X(1, 2); version=4)
Different array types with the same content hash to the same value:
stable_hash(view([1,2,3], 1:2); version=4) == stable_hash([1,2]; version=4)
Byte-equivalent arrays of all NumberType
values will hash to the same value:
stable_hash([0.0, 0.0]; version=4) == stable_hash([0, 0]; version=4)
stable_hash([0.0f0, 0.0f0]; version=4) != stable_hash([0, 0]; version=4) # not byte equivalent
Also, even though the bytes are the same, since the size is hashed, we have:
stable_hash([0.0f0, 0.0f0]; version=4) != stable_hash([0]; version=4)
If the eltype has a different StructType
, no collision will occur:
stable_hash(Any[0.0, 0.0]; version=4) != stable_hash([0, 0]; version=4)
Even if the mathematical values are the same, if the bytes are not the same no collision will occur:
stable_hash([1.0, 2.0]; version=4) != stable_hash([1, 2]; version=4)
Two types with the same name but different type parameters will hash the same (unless you define a transform_type_value
method for your type to include those type parameters in its return value):
struct MyType{T} end
stable_hash(MyType{:a}) == stable_hash(MyType{:b}) # true
Numerical changes will, of course, change the hash. One way this can catch you off guard are some differences in StaticArray
outputs between julia versions:
julia> using StaticArrays, StableHashTraits;
julia> begin
rotmatrix2d(a) = @SMatrix [cos(a) sin(a); -sin(a) cos(a)]
rotate(a, p) = rotmatrix2d(a) * p
rotate((pi / 4), SVector{2}(0.42095778959006, -0.42095778959006))
end;
In julia 1.9.4:
julia> bytes2hex(stable_hash(rotate((pi / 4), SVector{2}(0.42095778959006, -0.42095778959006)); version=4))
"4ccdc172688dd2b5cd50ba81071a19217c3efe2e3b625e571542004c8f96c797"
julia> rotate((pi / 4), SVector{2}(0.42095778959006, -0.42095778959006))
2-element SVector{2, Float64} with indices SOneTo(2):
7.419375817039376e-17
-0.5953242152248626
In julia 1.6.7
julia> bytes2hex(stable_hash(rotate((pi / 4), SVector{2}(0.42095778959006, -0.42095778959006)); version=4))
"3b8d998f3106c05f8b74ee710267775d0d0ce0e6780c1256f4926d3b7dcddf9e"
julia> rotate((pi / 4), SVector{2}(0.42095778959006, -0.42095778959006))
2-element SVector{2, Float64} with indices SOneTo(2):
5.551115123125783e-17
-0.5953242152248626
Basic Customization
To customize hashing, you typically want to simply override a method of StableHashTraits.transformer
. This should return a function wrapped in a StableHashTraits.Transformer
object that will be applied to an object and its result is the actual value that gets hashed.
using StableHashTraits
using StableHashTraits: Transformer
using Dates
struct MyType
data::Vector{UInt8}
metadata::Dict{Symbol, Any}
end
# ignore `metadata`, `data` will be hashed using fallbacks for `AbstractArray` type
StableHashTraits.transformer(::Type{<:MyType}) = Transformer(pick_fields(:data))
# NOTE: `pick_fields` is a helper function implemented by `StableHashTraits`
# it creates a named tuple with the given object fields; in the above code it is used
# in its curried form e.g. `pick_fields(:data)` is the same as `x -> pick_fields(x, :data)`
a = MyType(read("myfile.txt"), Dict{Symbol, Any}(:read => Dates.now()))
b = MyType(read("myfile.txt"), Dict{Symbol, Any}(:read => Dates.now()))
stable_hash(a; version=4) == stable_hash(b; version=4) # true
!!! note "Use pick_fields
and omit_fields
It is recommended you use pick_fields
or omit_fields
when you simply want to select some subset of fields to be hashed, as they allow for more optimized hashes than directly returning a named tuple of a field subset.
StableHashTraits.Transformer
takes a second positional argument which is the StructType
you wish to use on the transformed return value. By default StructType
is applied to the result to determine this automatically, but in some cases it can be useful to modify this trait by passing a second argument.
Make sure you don't return the object itself as an element of some collection. It can be tempting to do e.g. (mymetadata(x), x)
as a return value for transformer
's function. Instead you can use StableHashTraits.TransformIdentity
to make sure this won't lead to an infinite regress: e.g. (mymetadata(x), TransformIdentity(x))
. Using StableHashTraits.TransformIdentity
will cause x
's transformed result to be x
itself, thereby avoiding the infinite regress.
transformer
customizes how the content of your object is hashed. The hash of the type is customized separately using StableHashTraits.transform_type
.
Using Contexts
Because not every package knows about either StableHashTraits
or StructTypes
, there may be types you don't own that you want to customize the hash of. In this scenario you should define a context object that you pass as the second argument to stable_hash
and define a method of transformer
that dispatches on this context object as its second argument.
A context is simply an arbitrary object that defines a method for StableHashTraits.parent_context
. By default the context to stable_hash
is HashVersion{version}()
. Because of parent_context
, contexts can be stacked, and a HashVersion
should be at the bottom of the stack. There are fallback methods for transformer
that look at the value implemented by the parent context. In this way you need only define methods for the types you want to customize.
For example, this customization makes the ordering of named tuple keys affect the hash value.
julia> begin
struct NamedTuplesOrdered{T}
parent::T
end
StableHashTraits.parent_context(x::NamedTuplesOrdered) = x.parent
function transformer(::Type{<:NamedTuple}, ::NamedTuplesOrdered)
Transformer(identity, StructTypes.OrderedStruct())
end
context = NamedTuplesOrdered(HashVersion{4}())
end;
julia> stable_hash((; a=1:2, b=1:2), context) != stable_hash((; b=1:2, a=1:2), context)
true
Without this context, the keys are first sorted because StructType(NamedTuple) isa StructType.UnorderedStruct
.
As a short hand you can use StableHashTraits.@context
for creating simple contexts, like the one above. Like so:
julia> begin
@context NamedTuplesOrdered
function transformer(::Type{<:NamedTuple}, ::NamedTuplesOrdered)
Transformer(identity, StructTypes.OrderedStruct())
end
context = NamedTuplesOrdered(HashVersion{4}())
end;
julia> stable_hash((; a=1:2, b=1:2), context) != stable_hash((; b=1:2, a=1:2), context)
true
There are several useful, predefined contexts available in StableHashTraits
that can be used to change how hashing works:
Optimizing Custom Transformers
By default, stable hash traits follows a safe, but slower code path for arbitrary functions passed to Transformer
. However, in some cases it can use a faster code path, given that some assumptions about the types returned by the transforming function are maintained. The identity
function and the functions returned by pick_fields
and omit_fields
use this faster code path by default.
In particular, a keyword argument to Transformer
, hoist_type
can be set to true to use this faster code path. Functions that implement StableHashTraits.hoist_type(::typeof(fn))
can return true
to signal that they are safe when using this faster code path. This function is called to determine the default value of hoist_type
when this function is passed as the first argument to Transformer
.
The exact criteria for when this code path is unsafe are somewhat subtle, and will be described below, along with some examples. However, you can always safely use hoist_type=true
either when the function always returns the same type (e.g. it transforms all inputs into String
values) OR when the following three criteria are met:
- The type that
transformer
dispatches on is concrete. - The type that
transformer
dispatches on contains no abstract types: that is, for any contained array type or dict types, their eltypes are concrete and for any contained data type, itsfieldtypes
are concrete. - The function you pass to
Transformer
is type stable
When set to true, hoist_type=true
hashes the type of the pre-transformed object, prior to looping over the contents of the post-transformed object: its fields (for a data type) or the elements (for an array or dict type). Once the contents of the object are being looped over, hashing the type of each concrete-typed element or field is skipped. For example, when hashing an Array{Int}
the Int
will only be hashed once, not once for every element.
When hoist_type=false
(the default for most functions) the type of the return value from transformer
is hashed alongside the transformed value. This can be a lot slower, but ensures the that no unexpected hash collisions will occur.
More precisely, this hoisting is only valid when one of these two criteria are satisfied:
- the pre-transformed type is sufficient to disambiguate the hash of the downstream object values absent their object types.
- the post-transformed types do not change unless the caller inferred type of the input it depends on changes
The latter criteria is more stringent than type stability. A function input could have a caller-inferred type of Any
, be type stable, and return either Char
or an Int
depending on the value of its input. Such a function would violate this second criteria.
Examples
When is the pre-transformed type sufficient to disambiguate hashed values? First, many type-unstable functions should be considered unable to meet this criteria. The only time they are certain to be safe is when the values disambiguate the hash regardless of the type.
For instance, the assmuptions of hoist_type=true
would be violated by the function x -> x < 0 : Char(0) ? Int32(0)
because the bits that are hashed downstream are identical, even though the type information that should be hashed with them are different: Char
is a StringType
and Ing32
is a NumberType
.
In contrast, x -> x < 0 : Char(1) : Int32(2)
is safe to use with hoist_type=true
, because although the type changes, the byte sequence of the value never overlaps, regardless of the type returned. If you are confident the bit sequence will be unique in this way, you could safely use hoist_type=true
even though the transformer is type unstable.
Beware! When a type unstable function will be unsafe for a given type depends on the context, because users can define their own type_transform
that can lead to more type details being important to the hashed value. For instance, in the default context x -> x < 0 : Int32(0) ? UInt32(0)
would be considered safe, since both Int32
and UInt32
have the same type for purposes of hashing (NumberType
), but if the user were to write a custom transform_type
in a HashExactNumberType
context, now this function is no longer safe to use with hoist_type=true
.
These examples hopefully help to clarify when type-unstable functions can lead to unexpected hash collisions with hoist_type=true
. However type stable functions can also lead to invalid hashes with hoist_type=true
. For example:
struct MyType
a::Any
end
# ignore `metadata`, `data` will be hashed using fallbacks for `AbstractArray` type
# DO NOT DO IT THIS WAY; YOUR CODE WILL BE BUGGY!!!
StableHashTraits.transformer(::Type{<:MyType}) = Transformer(x -> (;x.a); hoist_type=true)
stable_hash(MyType(missing)) == stable_hash(MyType(nothing)) ## OOPS, we wanted these to be different but this returns `true`
Setting the flag to hoist_type=true
here causes the type of the missing
and nothing
to be hoisted, since the field a
is a concrete type in the return value of Transformer
's function. Since only the type of these two values is hashed, their hashes now collide. If the field of the pre-transformed type a
was concrete, there wouldn't be any problem, because its type would be hashed, and that would include either the Missing
or Nothing
in the type hash.
For this reason, it is better to use pick_fields
and omit_fields
to select or remove fields from a type you want to transform.
struct MyType
a::Any
end
StableHashTraits.transformer(::Type{<:MyType}) = Transformer(x -> (;x.a); hoist_type=false)
stable_hash(MyType(missing)) != stable_hash(MyType(nothing)) # true
# NOTE: pick_fields sets `hoist_type=true` by default; it is set here explicitly to
# clearly illustrate
# what is happening.
StableHashTraits.transformer(::Type{<:MyType}) = Transformer(pick_fields(:a); hoist_type=true)
stable_hash(MyType(missing)) != stable_hash(MyType(nothing)) # this is also `true` and is faster than the previous implementation
Bottom line: It is not sufficient for the function to be type stable to safely set hoist_type=true
. If the return value of your transformer function over its known domain returns multiple distinct concrete types, you can run into problems.
Customizing Type Hashes
Types are hashed by hashing the return value of StableHashTraits.transform_type
when hashing an object's type and the return value of StableHashTraits.transform_type_value
when hashing a type as a value (e.g. stable_hash(Int)
). The docs for these functions provide several examples of their usage.
In addition there is some structure of the type that is always hashed:
fieldtypes(T)
of anyStructType.DataType
(e.g. StructType.Struct)eltype(T)
of anyStructType.ArrayType
orStructType.DictType
orAbstractRange
These get added internally so as to ensure that the type-hoisting described above can rely on eltypes and fieldtypes storing all downstream children's concrete types.