Tables of matches
via NamedQuery
s.
julia> using KeywordSearch, Random, DataFrames
julia> queries = [NamedQuery(FuzzyQuery("dog"), "find dog"),
NamedQuery(FuzzyQuery("cat"), "find cat"),
NamedQuery(Query("koala") | FuzzyQuery("Opossum"), "find marsupial")]
3-element Array{NamedQuery{NamedTuple{(:query_name,),Tuple{String}},Q} where Q<:KeywordSearch.AbstractQuery,1}:
NamedQuery
├─ (query_name = "find dog",)
└─ FuzzyQuery("dog", DamerauLevenshtein{Nothing}(nothing), 2)
NamedQuery
├─ (query_name = "find cat",)
└─ FuzzyQuery("cat", DamerauLevenshtein{Nothing}(nothing), 2)
NamedQuery
├─ (query_name = "find marsupial",)
└─ Or
├─ Query("koala")
└─ FuzzyQuery("Opossum", DamerauLevenshtein{Nothing}(nothing), 2)
julia> words = ["dg", "cat", "koala", "opposum"]
4-element Array{String,1}:
"dg"
"cat"
"koala"
"opposum"
julia> corpus1 = Corpus([Document(randstring(rand(1:10)) * rand(words) * randstring(rand(1:10)),
(; doc_index=j)) for j in 1:10], (; name="docs"))
Corpus with 10 documents, each with metadata keys: (:doc_index,)
Corpus metadata: (name = "docs",)
julia> corpus1.documents
10-element Array{Document{NamedTuple{(:doc_index,),Tuple{Int64}}},1}:
Document with text " ZihvKfcati9K4m ". Metadata: (doc_index = 1,)
Document starting with " w15txPj6koalasCfRXqQ4BC…". Metadata: (doc_index = 2,)
Document starting with " 35shropposum8EK…". Metadata: (doc_index = 3,)
Document starting with " mR7Yopposum38sAEvjd5…". Metadata: (doc_index = 4,)
Document starting with " 0QpwRscatfwUKqq…". Metadata: (doc_index = 5,)
Document with text " SVkINDcatK ". Metadata: (doc_index = 6,)
Document starting with " 2lMhObbiekoala3EubnG…". Metadata: (doc_index = 7,)
Document starting with " wFjPuOHx9opposumbD28BS…". Metadata: (doc_index = 8,)
Document starting with " nGEbslMakoala29lS7yjdc…". Metadata: (doc_index = 9,)
Document with text " ckYAdg6eTI4kZ ". Metadata: (doc_index = 10,)
julia> corpus2 = Corpus([Document(randstring(rand(1:10)), (; doc_index=2 * j)) for j in 1:10],
(; name="other docs"))
Corpus with 10 documents, each with metadata keys: (:doc_index,)
Corpus metadata: (name = "other docs",)
julia> corpuses = [corpus1, corpus2]
2-element Array{Corpus{NamedTuple{(:name,),Tuple{String}},NamedTuple{(:doc_index,),Tuple{Int64}}},1}:
Corpus with 10 documents, each with metadata keys: (:doc_index,)
Corpus with 10 documents, each with metadata keys: (:doc_index,)
julia> matches = [match(named_query, corpus) for named_query in queries for corpus in corpuses];
julia> filter!(!isnothing, matches);
julia> DataFrame(matches)
4×7 DataFrame
Row │ document distance indices query ⋯
│ Document… Int64 UnitRang… Abstract… ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ Document starting with " w15txPj… 2 10:12 FuzzyQuery{Dame ⋯
2 │ Document with text " iosFNv6 ". … 2 2:4 FuzzyQuery{Dame
3 │ Document with text " ZihvKfcati9… 0 8:10 FuzzyQuery{Dame
4 │ Document starting with " w15txPj… 0 10:14 Query("koala")
4 columns omitted
We can also make use of Transducers.jl to easily multithread or parallelize across cores via tcollect
or dcollect
:
julia> using Transducers
julia> matches = tcollect(Filter(!isnothing)(MapSplat(match)(Iterators.product(queries, corpuses))));
julia> DataFrame(matches)
4×7 DataFrame
Row │ document distance indices query ⋯
│ Document… Int64 UnitRang… Abstract… ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ Document starting with " w15txPj… 2 10:12 FuzzyQuery{Dame ⋯
2 │ Document with text " ZihvKfcati9… 0 8:10 FuzzyQuery{Dame
3 │ Document starting with " w15txPj… 0 10:14 Query("koala")
4 │ Document with text " iosFNv6 ". … 2 2:4 FuzzyQuery{Dame
4 columns omitted