Tables of matches
via NamedQuery
s.
julia> using KeywordSearch, Random, DataFrames
julia> queries = [NamedQuery(FuzzyQuery("dog"), "find dog"),
NamedQuery(FuzzyQuery("cat"), "find cat"),
NamedQuery(Query("koala") | FuzzyQuery("Opossum"), "find marsupial")]
3-element Array{NamedQuery{NamedTuple{(:query_name,),Tuple{String}},Q} where Q<:KeywordSearch.AbstractQuery,1}:
NamedQuery
├─ (query_name = "find dog",)
└─ FuzzyQuery("dog", DamerauLevenshtein{Nothing}(nothing), 2)
NamedQuery
├─ (query_name = "find cat",)
└─ FuzzyQuery("cat", DamerauLevenshtein{Nothing}(nothing), 2)
NamedQuery
├─ (query_name = "find marsupial",)
└─ Or
├─ Query("koala")
└─ FuzzyQuery("Opossum", DamerauLevenshtein{Nothing}(nothing), 2)
julia> words = ["dg", "cat", "koala", "opposum"]
4-element Array{String,1}:
"dg"
"cat"
"koala"
"opposum"
julia> corpus1 = Corpus([Document(randstring(rand(1:10)) * rand(words) * randstring(rand(1:10)),
(; doc_index=j)) for j in 1:10], (; name="docs"))
Corpus with 10 documents, each with metadata keys: (:doc_index,)
Corpus metadata: (name = "docs",)
julia> corpus1.documents
10-element Array{Document{NamedTuple{(:doc_index,),Tuple{Int64}}},1}:
Document starting with " kipopwkoalaLKtZQxG…". Metadata: (doc_index = 1,)
Document with text " luEHfibwcatE ". Metadata: (doc_index = 2,)
Document with text " KnQcatFRpTr ". Metadata: (doc_index = 3,)
Document with text " dGydgv7R ". Metadata: (doc_index = 4,)
Document starting with " CkoalaRyf5JyNhzI…". Metadata: (doc_index = 5,)
Document with text " RuDcat6B0AsLsa ". Metadata: (doc_index = 6,)
Document with text " ekoaladjYWa ". Metadata: (doc_index = 7,)
Document starting with " 188opposumnMieYI…". Metadata: (doc_index = 8,)
Document with text " Rl4cmdkkFdgh ". Metadata: (doc_index = 9,)
Document starting with " C6GKdgUbUyQWqRus…". Metadata: (doc_index = 10,)
julia> corpus2 = Corpus([Document(randstring(rand(1:10)), (; doc_index=2 * j)) for j in 1:10],
(; name="other docs"))
Corpus with 10 documents, each with metadata keys: (:doc_index,)
Corpus metadata: (name = "other docs",)
julia> corpuses = [corpus1, corpus2]
2-element Array{Corpus{NamedTuple{(:name,),Tuple{String}},NamedTuple{(:doc_index,),Tuple{Int64}}},1}:
Corpus with 10 documents, each with metadata keys: (:doc_index,)
Corpus with 10 documents, each with metadata keys: (:doc_index,)
julia> matches = [match(named_query, corpus) for named_query in queries for corpus in corpuses];
julia> filter!(!isnothing, matches);
julia> DataFrame(matches)
5×7 DataFrame
Row │ document distance indices query ⋯
│ Document… Int64 UnitRang… Abstract… ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ Document starting with " kipopwk… 2 4:6 FuzzyQuery{Dame ⋯
2 │ Document with text " BbXoEGDbS "… 2 4:6 FuzzyQuery{Dame
3 │ Document starting with " kipopwk… 2 9:11 FuzzyQuery{Dame
4 │ Document with text " ma ". Metad… 2 2:4 FuzzyQuery{Dame
5 │ Document starting with " kipopwk… 0 8:12 Query("koala") ⋯
4 columns omitted
We can also make use of Transducers.jl to easily multithread or parallelize across cores via tcollect
or dcollect
:
julia> using Transducers
julia> matches = tcollect(Filter(!isnothing)(MapSplat(match)(Iterators.product(queries, corpuses))));
julia> DataFrame(matches)
5×7 DataFrame
Row │ document distance indices query ⋯
│ Document… Int64 UnitRang… Abstract… ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ Document starting with " kipopwk… 2 4:6 FuzzyQuery{Dame ⋯
2 │ Document starting with " kipopwk… 2 9:11 FuzzyQuery{Dame
3 │ Document starting with " kipopwk… 0 8:12 Query("koala")
4 │ Document with text " BbXoEGDbS "… 2 4:6 FuzzyQuery{Dame
5 │ Document with text " ma ". Metad… 2 2:4 FuzzyQuery{Dame ⋯
4 columns omitted