API Documentation

Queries

KeywordSearch.QueryType
Query(text::AbstractString; replacements=AUTOMATIC_REPLACEMENTS) <: AbstractQuery

A query to search for an exact match of a string, with one field:

  • text::String

The text is automatically processed by applying the replacements from replacements (which defaults to AUTOMATIC_REPLACEMENTS).

source
KeywordSearch.FuzzyQueryType
FuzzyQuery(text, dist, threshold; replacements=AUTOMATIC_REPLACEMENTS) <: AbstractQuery

A query to search for an fuzzy match of a string, with three fields:

  • text::String: the text to match
  • dist::D: the distance measure to use; defaults to DamerauLevenshtein()
  • threshold::T: the maximum threshold allowed for a match; defaults to 2.

The text is automatically processed by applying the replacements replacements (which defaults to AUTOMATIC_REPLACEMENTS).

source
KeywordSearch.NamedQueryType
NamedQuery(metadata::Union{String, NamedTuple}, query::AbstractQuery)

Creates a NamedQuery that stores a metadata field holding information about the query. When used with match, returns a NamedMatch, which carries the metadata of the NamedQuery as well as the metadata of the object which was matched.

Example

julia> document_1 = Document("One", (; document_name = "a"))
Document with text "One ". Metadata: (document_name = "a",)

julia> document_2 = Document("Two", (; document_name = "b"))
Document with text "Two ". Metadata: (document_name = "b",)

julia> corpus = Corpus([document_1, document_2], (; corpus_name = "Numbers"))
Corpus with 2 documents, each with metadata keys: (:document_name,)
Corpus metadata: (corpus_name = "Numbers",)

julia> query = NamedQuery(FuzzyQuery("one"), "find one")
NamedQuery
├─ (query_name = "find one",)
└─ FuzzyQuery("one", DamerauLevenshtein{Nothing}(nothing), 2)

julia> m = match(query, corpus)
NamedMatch
├─ (query_name = "find one", corpus_name = "Numbers", document_name = "a")
└─ QueryMatch with distance 1 at indices 1:3.
   ├─ FuzzyQuery("one", DamerauLevenshtein{Nothing}(nothing), 2)
   └─ Document with text "One ". Metadata: (document_name = "a",)

julia> m.metadata
(query_name = "find one", corpus_name = "Numbers", document_name = "a")
source

Documents

KeywordSearch.DocumentType
Document(text::AbstractString, metadata::T; replacements=AUTOMATIC_REPLACEMENTS) where {T<:NamedTuple}

Represents a single string document. This object has two fields,

  • text::String
  • metadata::T

The text is automatically processed by applying the replacements from replacements, which defaults to AUTOMATIC_REPLACEMENTS and adding a space to the start and end of the document (if one doesn't exist already).

source
Base.matchMethod
Base.match(query::AbstractQuery, document::Document)

Looks for a match for query in document. Returns either nothing if no match is found, or a QueryMatch object.

source
KeywordSearch.match_allMethod
match_all(query::AbstractQuery, document::Document)

Looks for all matches for query in the document. Returns a Vector QueryMatch objects corresponding to all of the matches found.

source

Corpuses

KeywordSearch.CorpusType
Corpus{T<:NamedTuple,D}

A corpus is a collection of Documents, along with some metadata. It has two fields,

  • documents::Vector{Document{D}}
  • metadata::T

Note each Document in a Corpus must have metadata of the same type.

source
Base.matchMethod
Base.match(query::AbstractQuery, corpus::Corpus)

Looks for a match for query in any Document in corpus. Returns either nothing if no match is found in any Document, or a QueryMatch object.

source
KeywordSearch.match_allMethod
match_all(query::AbstractQuery, corpus::Corpus)

Looks for all matches for query from all documents in corpus. Returns a Vector of QueryMatch objects corresponding to all of the matches found, across all doucments.

source

Matches

KeywordSearch.QueryMatchType
QueryMatch{Q<:AbstractQuery,Doc<:Document,D,I}

Represents a match for an AbstractQuery, with four fields:

  • query::Q: the query itself
  • document::Doc: the Document which was matched to
  • distance::D: the distance of the match
  • indices::I: the indices of where in the document the match occurred.
source
KeywordSearch.NamedMatchType
NamedMatch{T,M<:QueryMatch}

This object has two fields,

  • match::M, which holds a QueryMatch object corresponding to the match
  • and metadata::T, which holds a NamedTuple of metadata.

and is created by the method match(query::NamedQuery, obj).

NamedMatch satisfies the Tables.jl AbstractRow interface. This means that a vector of NamedMatch objects is a valid Tables.jl-compatible table.

Example

julia> document_1 = Document("one", (; document_name = "a"))
Document with text "one ". Metadata: (document_name = "a",)

julia> document_2 = Document("Two but there's also a one here.", (; document_name = "b"))
Document starting with "Two but there's…". Metadata: (document_name = "b",)

julia> query = NamedQuery(Query("one"), "find one")
NamedQuery
├─ (query_name = "find one",)
└─ Query("one")

julia> matches = match_all(query, Corpus([document_1, document_2], (;corpus_name="corpus")));


julia> using Tables


julia> Tables.istable(matches)
true

julia> Tables.schema(Tables.rowtable(matches))
Tables.Schema:
 :document       Document{NamedTuple{(:document_name,),Tuple{String}}}
 :distance       Int64
 :indices        UnitRange{Int64}
 :query          Query
 :query_name     String
 :corpus_name    String
 :document_name  String
source

Helper functions

KeywordSearch.explainFunction
explain([io=stdout], match; context=40)

Prints a human-readable explanation of the match and its context in the document in which it was found.

Example

julia> document = Document("The crabeating macacue ate a crab.")
Document starting with "The crabeating macacue…". Metadata: NamedTuple()

julia> query = augment(FuzzyQuery("crab-eating macaque"))
Or
├─ FuzzyQuery("crab eating macaque", DamerauLevenshtein{Nothing}(nothing), 2)
├─ FuzzyQuery("crabeating macaque", DamerauLevenshtein{Nothing}(nothing), 2)
├─ FuzzyQuery("crab eatingmacaque", DamerauLevenshtein{Nothing}(nothing), 2)
└─ FuzzyQuery("crabeatingmacaque", DamerauLevenshtein{Nothing}(nothing), 2)

julia> m = match(query, document)
QueryMatch with distance 1 at indices 5:22.

julia> explain(m)
The query "crabeating macaque" matched the text "The crabeating macacue ate a crab  " with distance 1.

julia> explain(m; context=5) # tweak the amount of context printed
The query "crabeating macaque" matched the text "The crabeating macacue ate…" with distance 1.

julia> sprint(explain, m) # to get the explanation as a string
"The query \"crabeating macaque\" matched the text \"The crabeating macacue ate a crab  \" with distance 1.\n"

julia> explain(match(Query("crab"), document)) # exact queries print slightly differently
The query "crab" exactly matched the text "The crabeating macacue ate a crab  ".

julia> explain(match(NamedQuery(Query("crab"), "crab query"), document)) # `NamedQuery`s print the same as their underlying query
The query "crab" exactly matched the text "The crabeating macacue ate a crab  ".
source
KeywordSearch.augmentFunction
augment(term) -> Vector{String}

Given a term, returns a list of terms which should be treated as synonyms. Currently only supports agumenting (spaces or hyphens) with (spaces, no spaces).

Example

julia> KeywordSearch.augment("arctic wolf")
2-element Vector{String}:
 "arctic wolf"
 "arcticwolf"
 
source
KeywordSearch.word_boundaryFunction
word_boundary(Q::AbstractQuery) -> AbstractQuery

Ensures that a word or phrase is not hyphenated or conjoined with the surrounding text.

Example

julia> using Test

julia> query = Query("word")
Query("word")

julia> @test match(query, Document("This matchesword ")) !== nothing
Test Passed

julia> @test match(word_boundary(query), Document("This matches word.")) !== nothing
Test Passed

julia> @test match(word_boundary(query), Document("This matches word ")) !== nothing
Test Passed

julia> @test match(word_boundary(query), Document("This matches word\nNext line")) !== nothing
Test Passed

julia> @test match(word_boundary(query), Document("This doesn't matchword ")) === nothing
Test Passed
source

Constants

KeywordSearch.AUTOMATIC_REPLACEMENTSConstant
const AUTOMATIC_REPLACEMENTS::Vector{Pair{Union{Regex,String},String}}

A list of replacements to automatically perform when preprocessing a Document. For example, if KeywordSearch.AUTOMATIC_REPLACEMENTS == ["a" => "b"], then Document("abc").text == "bbc" instead of "abc".

By default, AUTOMATIC_REPLACEMENTS contains only one replacement:

julia> KeywordSearch.AUTOMATIC_REPLACEMENTS
1-element Vector{Pair{Union{Regex, String}, String}}:
 r"[.!?><\-\v\f\s]+" => " "

which replaces certain punctuation characters, whitespace, and newlines with a space. This replacement is needed for word_boundary to work correctly, but you can remove it with empty!(KeywordSearch.AUTOMATIC_REPLACEMENTS) if you wish.

You an also add other preprocessing directives by push!ing further replacements into KeywordSearch.AUTOMATIC_REPLACEMENTS.

source