API Documentation
Queries
KeywordSearch.Query
— TypeQuery(text::AbstractString; replacements=AUTOMATIC_REPLACEMENTS) <: AbstractQuery
A query to search for an exact match of a string, with one field:
text::String
The text
is automatically processed by applying the replacements from replacements
(which defaults to AUTOMATIC_REPLACEMENTS
).
KeywordSearch.FuzzyQuery
— TypeFuzzyQuery(text, dist, threshold; replacements=AUTOMATIC_REPLACEMENTS) <: AbstractQuery
A query to search for an fuzzy match of a string, with three fields:
text::String
: the text to matchdist::D
: the distance measure to use; defaults toDamerauLevenshtein()
threshold::T
: the maximum threshold allowed for a match; defaults to 2.
The text
is automatically processed by applying the replacements replacements
(which defaults to AUTOMATIC_REPLACEMENTS
).
KeywordSearch.NamedQuery
— TypeNamedQuery(metadata::Union{String, NamedTuple}, query::AbstractQuery)
Creates a NamedQuery
that stores a metadata field holding information about the query. When used with match
, returns a NamedMatch
, which carries the metadata of the NamedQuery
as well as the metadata of the object which was matched.
Example
julia> document_1 = Document("One", (; document_name = "a"))
Document with text "One ". Metadata: (document_name = "a",)
julia> document_2 = Document("Two", (; document_name = "b"))
Document with text "Two ". Metadata: (document_name = "b",)
julia> corpus = Corpus([document_1, document_2], (; corpus_name = "Numbers"))
Corpus with 2 documents, each with metadata keys: (:document_name,)
Corpus metadata: (corpus_name = "Numbers",)
julia> query = NamedQuery(FuzzyQuery("one"), "find one")
NamedQuery
├─ (query_name = "find one",)
└─ FuzzyQuery("one", DamerauLevenshtein{Nothing}(nothing), 2)
julia> m = match(query, corpus)
NamedMatch
├─ (query_name = "find one", corpus_name = "Numbers", document_name = "a")
└─ QueryMatch with distance 1 at indices 1:3.
├─ FuzzyQuery("one", DamerauLevenshtein{Nothing}(nothing), 2)
└─ Document with text "One ". Metadata: (document_name = "a",)
julia> m.metadata
(query_name = "find one", corpus_name = "Numbers", document_name = "a")
Documents
KeywordSearch.Document
— TypeDocument(text::AbstractString, metadata::T; replacements=AUTOMATIC_REPLACEMENTS) where {T<:NamedTuple}
Represents a single string document. This object has two fields,
text::String
metadata::T
The text
is automatically processed by applying the replacements from replacements
, which defaults to AUTOMATIC_REPLACEMENTS
and adding a space to the start and end of the document (if one doesn't exist already).
Base.match
— MethodBase.match(query::AbstractQuery, document::Document)
Looks for a match for query
in document
. Returns either nothing
if no match is found, or a QueryMatch
object.
KeywordSearch.match_all
— Methodmatch_all(query::AbstractQuery, document::Document)
Looks for all matches for query
in the document. Returns a Vector
QueryMatch
objects corresponding to all of the matches found.
Corpuses
KeywordSearch.Corpus
— TypeCorpus{T<:NamedTuple,D}
A corpus is a collection of Document
s, along with some metadata. It has two fields,
documents::Vector{Document{D}}
metadata::T
Note each Document
in a Corpus
must have metadata of the same type.
Base.match
— MethodBase.match(query::AbstractQuery, corpus::Corpus)
Looks for a match for query
in any Document
in corpus
. Returns either nothing
if no match is found in any Document
, or a QueryMatch
object.
KeywordSearch.match_all
— Methodmatch_all(query::AbstractQuery, corpus::Corpus)
Looks for all matches for query
from all documents in corpus
. Returns a Vector
of QueryMatch
objects corresponding to all of the matches found, across all doucments.
Matches
KeywordSearch.QueryMatch
— TypeQueryMatch{Q<:AbstractQuery,Doc<:Document,D,I}
Represents a match for an AbstractQuery
, with four fields:
query::Q
: the query itselfdocument::Doc
: theDocument
which was matched todistance::D
: the distance of the matchindices::I
: the indices of where in thedocument
the match occurred.
KeywordSearch.NamedMatch
— TypeNamedMatch{T,M<:QueryMatch}
This object has two fields,
match::M
, which holds aQueryMatch
object corresponding to the match- and
metadata::T
, which holds aNamedTuple
of metadata.
and is created by the method match(query::NamedQuery, obj)
.
NamedMatch
satisfies the Tables.jl AbstractRow
interface. This means that a vector of NamedMatch
objects is a valid Tables.jl-compatible table.
Example
julia> document_1 = Document("one", (; document_name = "a"))
Document with text "one ". Metadata: (document_name = "a",)
julia> document_2 = Document("Two but there's also a one here.", (; document_name = "b"))
Document starting with "Two but there's…". Metadata: (document_name = "b",)
julia> query = NamedQuery(Query("one"), "find one")
NamedQuery
├─ (query_name = "find one",)
└─ Query("one")
julia> matches = match_all(query, Corpus([document_1, document_2], (;corpus_name="corpus")));
julia> using Tables
julia> Tables.istable(matches)
true
julia> Tables.schema(Tables.rowtable(matches))
Tables.Schema:
:document Document{NamedTuple{(:document_name,),Tuple{String}}}
:distance Int64
:indices UnitRange{Int64}
:query Query
:query_name String
:corpus_name String
:document_name String
Helper functions
KeywordSearch.explain
— Functionexplain([io=stdout], match; context=40)
Prints a human-readable explanation of the match and its context in the document in which it was found.
Example
julia> document = Document("The crabeating macacue ate a crab.")
Document starting with "The crabeating macacue…". Metadata: NamedTuple()
julia> query = augment(FuzzyQuery("crab-eating macaque"))
Or
├─ FuzzyQuery("crab eating macaque", DamerauLevenshtein{Nothing}(nothing), 2)
├─ FuzzyQuery("crabeating macaque", DamerauLevenshtein{Nothing}(nothing), 2)
├─ FuzzyQuery("crab eatingmacaque", DamerauLevenshtein{Nothing}(nothing), 2)
└─ FuzzyQuery("crabeatingmacaque", DamerauLevenshtein{Nothing}(nothing), 2)
julia> m = match(query, document)
QueryMatch with distance 1 at indices 5:22.
julia> explain(m)
The query "crabeating macaque" matched the text "The crabeating macacue ate a crab " with distance 1.
julia> explain(m; context=5) # tweak the amount of context printed
The query "crabeating macaque" matched the text "The crabeating macacue ate…" with distance 1.
julia> sprint(explain, m) # to get the explanation as a string
"The query \"crabeating macaque\" matched the text \"The crabeating macacue ate a crab \" with distance 1.\n"
julia> explain(match(Query("crab"), document)) # exact queries print slightly differently
The query "crab" exactly matched the text "The crabeating macacue ate a crab ".
julia> explain(match(NamedQuery(Query("crab"), "crab query"), document)) # `NamedQuery`s print the same as their underlying query
The query "crab" exactly matched the text "The crabeating macacue ate a crab ".
KeywordSearch.augment
— Functionaugment(term) -> Vector{String}
Given a term, returns a list of terms which should be treated as synonyms. Currently only supports agumenting (spaces or hyphens) with (spaces, no spaces).
Example
julia> KeywordSearch.augment("arctic wolf")
2-element Vector{String}:
"arctic wolf"
"arcticwolf"
KeywordSearch.word_boundary
— Functionword_boundary(Q::AbstractQuery) -> AbstractQuery
Ensures that a word or phrase is not hyphenated or conjoined with the surrounding text.
Example
julia> using Test
julia> query = Query("word")
Query("word")
julia> @test match(query, Document("This matchesword ")) !== nothing
Test Passed
julia> @test match(word_boundary(query), Document("This matches word.")) !== nothing
Test Passed
julia> @test match(word_boundary(query), Document("This matches word ")) !== nothing
Test Passed
julia> @test match(word_boundary(query), Document("This matches word\nNext line")) !== nothing
Test Passed
julia> @test match(word_boundary(query), Document("This doesn't matchword ")) === nothing
Test Passed
Constants
KeywordSearch.AUTOMATIC_REPLACEMENTS
— Constantconst AUTOMATIC_REPLACEMENTS::Vector{Pair{Union{Regex,String},String}}
A list of replacements to automatically perform when preprocessing a Document
. For example, if KeywordSearch.AUTOMATIC_REPLACEMENTS == ["a" => "b"]
, then Document("abc").text == "bbc"
instead of "abc".
By default, AUTOMATIC_REPLACEMENTS
contains only one replacement:
julia> KeywordSearch.AUTOMATIC_REPLACEMENTS
1-element Vector{Pair{Union{Regex, String}, String}}:
r"[.!?><\-\v\f\s]+" => " "
which replaces certain punctuation characters, whitespace, and newlines with a space. This replacement is needed for word_boundary
to work correctly, but you can remove it with empty!(KeywordSearch.AUTOMATIC_REPLACEMENTS)
if you wish.
You an also add other preprocessing directives by push!
ing further replacements into KeywordSearch.AUTOMATIC_REPLACEMENTS
.