This repository contains the Scala code for generic grounding.
Grounding means that you have a text you want to ground, together with a list of candidate groundings and the goal is to return the best items from the candidate list (if any). Best can be application dependent, but it typically means the most similar (say, semantically). For example, if we want to ground "dog"
and the list of candidates is: ["Shiba Inu", "cat", "Brown Bear"]
, then we should (probably) ground it to "Shiba Inu"
(which is a dog breed).
The project contains 4
types of grounders:
The Exact Matcher grounder uses TermQuery
from Lucene for matching. It looks for exact (i.e., ===
) matches. For example, assuming we search for "dog"
, and we have (pseudocode):
Doc1({"field_a": "dog", "field_b": "dog cat"})
Doc2({"field_a": "dog cat", "field_b": "cat dog"})
Assuming we search on field_a
and field_b
, then Doc1
is a match (because of field_a
, dog
== dog
)
Doc2
, however, is not an exact match (dog
is not == with dog cat
, dog
is not == with cat dog
)
In other words, the field we are searching on should have exactly the text we are searching for, nothing less, nothing more.
To supplement the Exact Matcher and to relax the very strict requirements, we provide two types of Fuzzy Matchers
This grounder attempts to do fuzzy matching
with slops (i.e., gaps)
For example, assuming we search for "dog cat"
, and we have (pseudocode):
Doc1({"field_a": "dog dog", "field_b": "dog dog cat"})
Doc2({"field_a": "dog cat dog", "field_b": "cat dog"})
Assuming we search on field_a
and field_b
, then Doc1
is a match (because of field_b
),
and Doc2
is a match (because field_a
)
This grounder allows for slops (i.e. gaps in between the search phrase). For example:
- slop=0 means no gap (
dog cat
fordog cat
), - slop=1 means a gap of 1 item (
dog _ cat
fordog cat
) - slop=2 means a gap of 2 items (
dog _ _ cat
fordog cat
), etc
One might wonder if a slop=0
is similar to Exact Grounder
or not. In other words, one might wonder what happens in a scenario like this:
Doc1({"field_a": "dog dog"})
Doc2({"field_a": "dog dog cat"})
Doc3({"field_a": "cat dog dog"})
Doc4({"field_a": "cat dog dog cat"})
Doc5({"field_a": "cat dog cat dog cat"})
Where we try to ground the following text: dog dog
and we use a slop of 0 (i.e., slop=0
).
It is clear that Doc1
will be grounded, but the question is if (and if yes, which of) Doc2
, Doc3
, Doc4
, or Doc5
will be grounded.
The answer is that Doc2
, Doc3
, and Doc4
will be grounded, and Doc5
will not be grounded.
In other words, a slop=0
does not behave exactly like an Exact Grounder
.
This grounder attempts to do fuzzy matching
with edit distance.
For example, assuming we search for "dog"
, and we have:
Doc1({"field_a": "dog"})
Doc2({"field_a": "dogg"})
Doc3({"field_a": "doggg"})
Doc4({"field_a": "cat"})
Assuming we search on field_a
with max Edit = 1
, then Doc1
and Doc2
are a match.
Doc3
is not because the edit distance between dog
and doggg
is larger.
Doc4
is not because the edit distance between dog
and cat
is larger.
Assuming we search on field_a
with max Edit = 2
, then Doc1
, Doc2
, and Doc3
are a match.
Doc4
is not because the edit distance between dog
and doggg
is larger.
One might wonder what happens when max Edit = 0
. Assuming the following example (pseudocode):
Doc1({"field_a": "dog"})
Doc2({"field_a": "cat dog"})
Doc3({"field_a": "dog cat"})
Doc4({"field_a": "cat dog cat"})
And we try to ground dog
, all Doc1
, Doc2
, Doc3
, and Doc4
will be returned. Doc2
and Doc3
will have the same score, and score(Doc1)
> score(Doc2)
= score(Doc3)
> score(Doc4)
To go beyond simple lexical match, we provide a Neural Grounder mechanism which is capable (in theory) of more "semantic" matching.
For this, we use an ONNX model which we call from scala. Please make sure to use the same format as the one used during the training phase of the neural network.
The way this grounder works is to obtain a score such as: neural_network(text, grounding_candidate)
. If this score is greater than a threshold (say, 0.5
), then we return this as an yes (i.e., grounded). Otherwise, we do not return it.
We allow arbitrary combinations of the aforementioned grounders. Currently, we offer a sequential grounder which is just a list of grounders:
[
'Exact Matcher', # 1
'Fuzzy Slop Matcher (slops=[0,1,2,4])', # 2
'Fuzzy Edit Distance Matcher (max_edit = 0)', # 3
'Fuzzy Edit Distance Matcher (max_edit = 1)', # 4
'Fuzzy Edit Distance Matcher (max_edit = 2)', # 5
'Neural Matcher (threshold = 0.8)', # 6
'Neural Matcher (threshold = 0.6)', # 7
'Neural Matcher (threshold = 0.5)', # 8
]
As highlighted in the example above, we can have the same component multiple times, with different parameters.
The way the Sequential Grounder will work is to try to ground and successfully return k
candidates.
It will first try Exact Matcher
(component 1
). If this component already returns k
(or more) candidates, we do not proceed to the next component.
However, if this component only returns n
candidates, where n < k
, then we proceed to the next component Fuzzy Slop Matcher (slops=[0,1,2,4])
(component 2
) and
try to successfully return k-n
candidates. The same procedure continues until ee get to k
candidates or we used all the components.
Therefore, it is wise to order these components based on their precision, where the higher precision components come first.
The implementation of early stopping is done using Stream
in scala:
components.toStream.flatMap { grounder =>
grounder.ground(text, context, groundingTargets, k)
}.take(k)
Each grounder
returns a stream. We have a take(k)
at the end, which ensures that we only to the minimum amount of work necessary to populate the result with k
items.
This early stopping behavior is specific to Stream
. Please see the concepts of lazy collection
if you are interested in reading more.
We alllow dynamic configuration of the Sequential Grounder
, using reference.conf
. Please see an example below.
sieve {
# How many results to return (overridable)
k = 5
# Details of each component
# Contains details such as:
# - type (this is from a predefined list)
# - fieldNames (over which fields will it operate)
#
component1 {
name = "Exact Matcher"
type = "exact_matcher"
# What fields to search on
# Order is important; Left -> More important (will be searched first)
# fieldNames = ["name", "synonym1", "synonym2", "synonym3", "synonym4", "synonym5", "synonym6", "synonym7", "synonym8", "synonym9", "synonym10"]
fieldNames = [["name", "synonym1", "synonym2", "synonym3", "synonym4", "synonym5", "synonym6", "synonym7", "synonym8", "synonym9", "synonym10"], ]
}
component2 {
name = "Fuzzy Edit Distance Matcher"
type = "fuzzy_editdistance_matcher"
# fieldNames = ["name", "synonym1", "synonym2", "synonym3", "synonym4", "synonym5", "synonym6", "synonym7", "synonym8", "synonym9", "synonym10", "description"]
fieldNames = [["name", "synonym1", "synonym2", "synonym3", "synonym4", "synonym5", "synonym6", "synonym7", "synonym8", "synonym9", "synonym10"], ["description"]]
editDistance = 2
}
component3 {
name = "Fuzzy Slop Matcher"
type = "fuzzy_slop_matcher"
# fieldNames = ["name", "synonym1", "synonym2", "synonym3", "synonym4", "synonym5", "synonym6", "synonym7", "synonym8", "synonym9", "synonym10", "description"]
fieldNames = [["name", "synonym1", "synonym2", "synonym3", "synonym4", "synonym5", "synonym6", "synonym7", "synonym8", "synonym9", "synonym10"], ["description"]]
slops = [1, 2, 4, 8]
}
component4 {
name = "Neural Matcher"
type = "neural_matcher"
modelPath = "/home/rvacareanu/projects_7_2309/skema_python/results/240128/onnx/model.onnx" # Where to load the model from (overridable)
threshold = 0.5 # Ground based on score (overridable)
}
}
Please notice that we enumerate each component like so: component1
, component2
, etc. Please use consecutive numbers, as when we create the grounder we stop once component<i>
is not present.
An explanation of these fields is as follows:
name
-> The name of the grounder. We use it only to specify additional details once the grounded has ended;type
-> This field specifies which type of grounder is that component. It has to be one of the following:exact_matcher
fuzzy_slop_matcher
fuzzy_editdistance_matcher
neural_matcher
fieldNames
-> This specifies on which fields from the document to do the searching. Please notice that it is a list of lists[["name", "synonym1"], ["description"]]
. This is because["name", "synonym1"]
will have the same priority, which is higher than["description"]
. In other words, we allow to do early stopping on fields as well, and we consider that a match on"name"
, for example, is better than a match on"description"
. Also, please notice the fieldssynonym1
, etc. The original documents contain a field calledsynonym
which is a list. We index every synonym in a different field. We add1
,2
, etc to differentiate between multiple synonyms.
Each component has its own specific fields. For example, a fuzzy_editdistance_matcher
has editDistance
. A fuzzy_slop_matcher
has slops
, and a neural_matcher
has modelPath
and threshold
. Please notice that a neural_matcher
does not have fieldNames
since it operates only over the candidate text.
We index id
, name
, description
, and synonyms
. Since the document can have multiple synonyms
, we index them as synonyms1
, synonyms2
, etc.
While this technique for indexing the synonyms
can be used to artificially differentiate between different synonyms, we do not do this.
We do not have any way to assess which synonym is better, so we treat each synonym as equally good. If one wants to differentiate, please use fieldNames
like so [["name"], ["synonym1"], ["synonym2"], ["description"]]
(i.e., notice that instead of [["name", "synonym1", "synonym2"], ["description"]]
, we did [["name"], ["synonym1"], ["synonym2"], ["description"]]
, which means that we first try on name
, and if we successfully grounded k
, we stop. If not, we try synonym1
, etc).
The code for indexing is in org.clulab.scala_grounders.indexing.BuildIndex
, specifically the method called buildIndexFromPaths
.
We use a functional-like interface, which looks like this:
def ground(
text: String,
context: Option[String],
groundingTargets: Seq[DKG],
k: Int = 1
): Stream[GroundingResultDKG]
One peculiarity is that we do not provide an index. The initial implementation has been to index on the fly. While this can be ok for small-scale applications, if one repeteadly grounds over the same targets, it can become unnecessarily slow. To address this while maintaining the same simple interface, each grounder has a function called mkFast
, which receives an IndexSearcher
as parameter and returns a Grounder
. The idea is to allow each grounder to store the index locally. Please see the Fast*
variants (e.g., org.clulab.scala_grounders.grounding.FastExactGrounder
).
One can call mkFast
on an already Fast
grounder to change the underlying index.
For convenience, an overloaded version of mkFast
is provided, which takes a groundingTargets: Seq[DKG]
as parameter (instead of is: IndexSearcher
), but only for SequentialGrounder
. The underlying implementation is simple: just create an IndexSearcher
and call the original mkFast
.
We provide tests, where one can see how each grounder can be used in isolation together with some expected behavior.
Please see src/test/scala/org/clulab/scala_grounders
Either local sbt publishLocal
or on a repository.