Quick tutorial
In this example, we will use some sample text modified from the public domain text Aristotle's History of Animals http://www.gutenberg.org/files/59058/59058-0.txt.
julia> using KeywordSearch, Random
julia> text_with_typos = Document("""
Some animals have fet, others have noone; of the former some have
two feet, as mankind and birdsonly; others have four, as the lizard
and the dog; others, as the scolopendra and bee, have many feet; but
all have their feet in pairs.
""")
Document starting with " Some animals have…". Metadata: NamedTuple()
julia> fuzzy_query = FuzzyQuery("birds only")
FuzzyQuery{DamerauLevenshtein{Nothing},Int64}("birds only", DamerauLevenshtein{Nothing}(nothing), 2)
julia> m = match(fuzzy_query, text_with_typos)
QueryMatch with distance 2 at indices 92:101.
julia> explain(m)
The query "birds only" matched the text "…former some have two feet, as mankind and birdsonly; others have four, as the lizard and the…" with distance 2.
Here, you'll notice an exact query does not match, since the words "birds" and "only" have been conjoined:
julia> exact_query = Query("birds only")
Query("birds only")
julia> match(exact_query, text_with_typos) # nothing, no exact match
KeywordSearch offers the augment
function specifically to address mis-conjoined words:
julia> augmented_query = augment(exact_query)
Or
├─ Query("birds only")
└─ Query("birdsonly")
julia> m2 = match(augmented_query, text_with_typos) # now it matches
QueryMatch with distance 0 at indices 93:101.
julia> m2.query # which of the two queries in the `Or` matched?
Query("birdsonly")
Here, augment
generated an Or
query, but we can generate one ourselves:
julia> dog_or_cat = Query("dog") | Query("cat")
Or
├─ Query("dog")
└─ Query("cat")
julia> m3 = match(dog_or_cat, text_with_typos)
QueryMatch with distance 0 at indices 144:146.
julia> explain(m3)
The query "dog" exactly matched the text "…others have four, as the lizard and the dog; others, as the scolopendra and bee, have…".
Note also that FuzzyQuery
by default uses the DamerauLevenshtein()
distance from StringDistances.jl, and searches for a match within a cutoff of 2 but you can pass it another distance or use another cutoff:
julia> fuzzy_query_2 = FuzzyQuery("brid nly", DamerauLevenshtein(), 4)
FuzzyQuery{DamerauLevenshtein{Nothing},Int64}("brid nly", DamerauLevenshtein{Nothing}(nothing), 4)
julia> m4 = match(fuzzy_query_2, text_with_typos)
QueryMatch with distance 4 at indices 93:100.
julia> explain(m4)
The query "brid nly" matched the text "…former some have two feet, as mankind and birdsonly; others have four, as the lizard and the…" with distance 4.