Tables of matches

via NamedQuerys.

julia> using KeywordSearch, Random, DataFrames

julia> queries = [NamedQuery(FuzzyQuery("dog"), "find dog"),
                  NamedQuery(FuzzyQuery("cat"), "find cat"),
                  NamedQuery(Query("koala") | FuzzyQuery("Opossum"), "find marsupial")]
3-element Array{NamedQuery{NamedTuple{(:query_name,),Tuple{String}},Q} where Q<:KeywordSearch.AbstractQuery,1}:
 NamedQuery
├─ (query_name = "find dog",)
└─ FuzzyQuery("dog", DamerauLevenshtein{Nothing}(nothing), 2)
 NamedQuery
├─ (query_name = "find cat",)
└─ FuzzyQuery("cat", DamerauLevenshtein{Nothing}(nothing), 2)
 NamedQuery
├─ (query_name = "find marsupial",)
└─ Or
   ├─ Query("koala")
   └─ FuzzyQuery("Opossum", DamerauLevenshtein{Nothing}(nothing), 2)

julia> words = ["dg", "cat", "koala", "opposum"]
4-element Array{String,1}:
 "dg"
 "cat"
 "koala"
 "opposum"

julia> corpus1 = Corpus([Document(randstring(rand(1:10)) * rand(words) * randstring(rand(1:10)),
                                  (; doc_index=j)) for j in 1:10], (; name="docs"))
Corpus with 10 documents, each with metadata keys: (:doc_index,)
Corpus metadata: (name = "docs",)

julia> corpus1.documents
10-element Array{Document{NamedTuple{(:doc_index,),Tuple{Int64}}},1}:
 Document starting with " kipopwkoalaLKtZQxG…". Metadata: (doc_index = 1,)
 Document with text " luEHfibwcatE ". Metadata: (doc_index = 2,)
 Document with text " KnQcatFRpTr ". Metadata: (doc_index = 3,)
 Document with text " dGydgv7R ". Metadata: (doc_index = 4,)
 Document starting with " CkoalaRyf5JyNhzI…". Metadata: (doc_index = 5,)
 Document with text " RuDcat6B0AsLsa ". Metadata: (doc_index = 6,)
 Document with text " ekoaladjYWa ". Metadata: (doc_index = 7,)
 Document starting with " 188opposumnMieYI…". Metadata: (doc_index = 8,)
 Document with text " Rl4cmdkkFdgh ". Metadata: (doc_index = 9,)
 Document starting with " C6GKdgUbUyQWqRus…". Metadata: (doc_index = 10,)

julia> corpus2 = Corpus([Document(randstring(rand(1:10)), (; doc_index=2 * j)) for j in 1:10],
                        (; name="other docs"))
Corpus with 10 documents, each with metadata keys: (:doc_index,)
Corpus metadata: (name = "other docs",)

julia> corpuses = [corpus1, corpus2]
2-element Array{Corpus{NamedTuple{(:name,),Tuple{String}},NamedTuple{(:doc_index,),Tuple{Int64}}},1}:
 Corpus with 10 documents, each with metadata keys: (:doc_index,)
 Corpus with 10 documents, each with metadata keys: (:doc_index,)

julia> matches = [match(named_query, corpus) for named_query in queries for corpus in corpuses];

julia> filter!(!isnothing, matches);

julia> DataFrame(matches)
5×7 DataFrame
 Row │ document                           distance  indices    query           ⋯
     │ Document…                          Int64     UnitRang…  Abstract…       ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Document starting with " kipopwk…         2  4:6        FuzzyQuery{Dame ⋯
   2 │ Document with text " BbXoEGDbS "…         2  4:6        FuzzyQuery{Dame
   3 │ Document starting with " kipopwk…         2  9:11       FuzzyQuery{Dame
   4 │ Document with text " ma ". Metad…         2  2:4        FuzzyQuery{Dame
   5 │ Document starting with " kipopwk…         0  8:12       Query("koala")  ⋯
                                                               4 columns omitted

We can also make use of Transducers.jl to easily multithread or parallelize across cores via tcollect or dcollect:

julia> using Transducers

julia> matches = tcollect(Filter(!isnothing)(MapSplat(match)(Iterators.product(queries, corpuses))));

julia> DataFrame(matches)
5×7 DataFrame
 Row │ document                           distance  indices    query           ⋯
     │ Document…                          Int64     UnitRang…  Abstract…       ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ Document starting with " kipopwk…         2  4:6        FuzzyQuery{Dame ⋯
   2 │ Document starting with " kipopwk…         2  9:11       FuzzyQuery{Dame
   3 │ Document starting with " kipopwk…         0  8:12       Query("koala")
   4 │ Document with text " BbXoEGDbS "…         2  4:6        FuzzyQuery{Dame
   5 │ Document with text " ma ". Metad…         2  2:4        FuzzyQuery{Dame ⋯
                                                               4 columns omitted