Language for making graph queries from data
Ophion queries are an array of graph operations. Each element of this array is itself an array, with the first element being the name of the operation and the rest of the elements parameters to this operation.
Given this schema (this is the current BMEG schema):
Here is an example query that starts from Individual vertexes from BRCA and traverses to all associated Variants called from MUTECT, and returns every Individual/Variant pair:
[["where",{"disease_code":"BRCA"}],
["mark","individual"],
["from","sampleOf","Biosample"],
["from","callSetOf","CallSet"],
["where",{"method":"MUTECT"}],
["fromUnique","variantCall","Variant"],
["mark","variant"],
["select",["individual","variant"]]]
This is the actual json payload. You can deliver this through any http method you wish to use. Here is an example query through curl that gets all Biosamples for a given Individual:
curl \
-X POST \
-H "Content-Type: application/json" \
-d '[["mark","individual"],["from","sampleOf","Biosample"],["mark","biosample"],["select",["individual","biosample"]]]' \
http://10.96.11.144/query/Individual
If you notice, the starting vertex label is in the url, while the query is posted as json.
The operations currently supported are:
Traversal operations help you navigate the graph:
- from
- to
- fromEdge
- toEdge
- fromVertex
- toVertex
- fromUnique
- toUnique
Mark and select work together to keep track of state during traversals, whereas gather does multiple queries at once:
- mark
- select
- gather
Filter operations help you cull or craft the results:
- where
- dedup
- limit
- order
- offset
- match
- values
Aggregate operations assemble metrics from a set of results:
- count
- aggregate
- groupCount
Each of these operations is explained in more detail below.
The two main traversal operations are from
and to
. Because Ophion works with directed graphs, there is a distinction between traversing across an edge in either direction. Either requires two additional pieces of information, an edge label and the ultimate destination label.
One upshot of this is that in order to craft traversals, you need to know the schema of the graph you are querying. The raw schema is defined in a protograph file (for BMEG the protograph file is here). This will show you all vertex labels and all edges they have to other labels.
If you want a visual representation then you can use the BMEG website code to generate a schema in cytoscape like the example above.
Either way, once you have your schema you know a few key things:
- vertex labels
- edge labels
- edge directions
These are the things you need to know to successfully traverse the graph.
The to
operation travels across outgoing edges. It requires two pieces of information: the edge label, and the destination label:
["to", $edgeLabel, $destinationLabel]
As an example say we are on a Variant vertex and want to see what Gene the Variant is in. The query would look like this:
["to", "variantIn", "Gene"]
The from
operation is the same as to
but going in the other direction. To demonstrate this, let's reverse the previous query and find every Variant that is in a given Gene (assuming we are already on a Gene vertex):
["from", "variantIn", "Variant"]
Say you are going from Variants to Genes. Every Gene has many associated Variants, so if you go this direction you will end up with a lot of the same Genes in your traversal. Sometimes this is what you want, but sometimes you don't care where they came from, you just want to know which Genes were implicated in your given set of Variants.
The toUnique
and fromUnique
operations work exactly the same as their non-unique counterparts, but the outcome will merge all identical vertexes.
The *Edge and *Vertex traversals are for navigating to and from the edge records between vertexes. Here is an example of going step by step between Gene and Variant (starting on Gene):
[["fromEdge", "variantIn"],
....
(currently on the edge)
....
["fromVertex", "Variant"]]
This might seem like just a more verbose way to get from Genes to Variants, but sometimes interesting information is stored on edges that you could filter on. And edges themselves may contain the information you are looking for, in which case you don't even need to continue the traversal.
As you may have guessed, to
and from
are just shorthand for the sequence toEdge/toVertex
or fromEdge/fromVertex
.
As you traverse through the graph sometimes you want to keep track of where you are. Also, often you care about what things are associated to what other things you were visiting previously. This is what mark
and select
are for.
To save the current point of your traversal you can call mark
with a label, and at the end of the traversal you can select
any of the mark
s you made along the way:
[["mark","individual"],
["from","sampleOf","Biosample"],
["mark","biosample"],
["select",["individual","biosample"]]]
This traversal starts at Individual, marks it, then traverses to Biosample, marks that as well, then finally select
s the two marked elements. This will return objects with the keys from the select
, whose values are the gids of the marked elements.
....
{"individual": "individual:TCGA-XE-ANJJ", "biosample": "biosample:tcga:TCGA-XE-ANJJ:normal"}
{"individual": "individual:TCGA-XE-ANJJ", "biosample": "biosample:tcga:TCGA-XE-ANJJ:tumor"}
....
The gather
operation let's you perform several traversals in the same query. Each key in the gather
operation represents a traversal unto itself, and the results for those traversals will be returned under those keys:
["gather": {"a": [....], "b": [....], "z": [....]}]
Each element in the dots is full a traversal.
Often you are making a query to find specific things, not just all vertexes or edges of one type. Filters allow you to craft what elements your traversal is concerned with.
The where
operation is the way to select specific elements during a traversal. For instance, if I start a query from Individual it will happily return all Individuals, but what if I am only concerned with Individuals with breast cancer? This is where where
comes in:
["where": {"disease_code": "BRCA"}]
This will work with any property in the vertex:
["where": {"gender": "MALE"}]
and can even work with intersections (AND):
["where": {"disease_code": "BRCA", "gender": "MALE"}] // probably zero results
The where
syntax is based off of mongo conditionals, so anything they can do you can do. Here are a couple of other useful examples.
Say you have a list of values and want to see if any of them match a given field. You can use $in
:
["where": {"disease_code": {"$in": ["BRCA", "SKCM", "LUAD"]}}]
We can have multiple keys to do an AND, what about an OR?
["where": {"$or" [{"disease_code": "BRCA"}, {"gender": "MALE"}]}] // more results this time
That covers most of the use cases, but there are many more.
The dedup
operator will cull duplicate elements. This rarely needs to be called on its own since we have the toUnique
and fromUnique
operators which combine it with a traversal (when it is most efficient to do so), but is provided for the cases when it still makes sense to invoke directly.
dedup
can be called on its own with no arguments (in which case it deduplicates based on gid), or you can supply the field. For instance, if you ran
["dedup", "gender"]
on all Individuals you would be left with one record with MALE as gender and one with FEMALE.
There are a large number of records in the db, and often times you don't need to return all of them. Use limit
to limit the number of records returned:
["limit", 10]
This will return only the first 10. When combined with offset
and order
you can emulate pagination.
offset
will drop a given number of records and return the rest. This would get all of the records the limit
above did not return:
["offset", 10]
The default ordering in the database is when the document was inserted (timestamp). If you want them to come back in a different order you can use the order
operation:
["order", {"disease_code", 1}]
A 1
stands for ascending order. Use -1
for descending.
If you only need certain keys from your results, values
is the way to go:
["values", ["gender", "disease_code"]] // filter out a lot of keys
Once you have all of your results, often you want some kind of aggregate metric to get a sense of what you are working with.
This is the simplest aggregate. Just find out how many things you have. Extremely useful for all of its simplicity:
["count"] // returns a number!
The slightly more sophisticated older brother of count
, groupCount
returns the counts for various values found in a given field:
["groupCount", "gender"] // probably not just two
The aggregate
operation can perform several aggregations at once, one for each key provided. There are three types of aggregations that can be performed, two of which are specific to numerical or ordered values.
The non-numeric aggregation is called terms
, and will return the top n values for a given field.
["aggregate" {"disease_code": {"terms": 10}}]
This will return a series of results with two keys, one of the field provided, and the other a count
detailing the number of records with that value.
{"disease_code":
[{"disease_code":"BRCA","count":1097},
{"disease_code":null,"count":959},
{"disease_code":"UCEC","count":548},
{"disease_code":"KIRC","count":537},
{"disease_code":"HNSC","count":528},
{"disease_code":"LUAD","count":522},
{"disease_code":"THCA","count":507},
{"disease_code":"LUSC","count":504},
{"disease_code":"PRAD","count":500},
{"disease_code":"COAD","count":459}]}
If you provide multiple aggregations the results will be under the keys specified.
If you are working with numeric data, you can get some sense of the range of values by sampling the values at specific percentages of the whole set. These percentiles are with respect to number of records, not the value, so if you ask for the 5th percentile of a field you will get the value that if you divide up the records into 20 equal sorted parts and take the lowest part, is the highest value for that field in that part.
It works like this (say we are querying Genes):
["aggregate": {"start": {"percentile": [5, 10, 80, 90]}}]
This returns the values at the 5th, 10th, 80th and 90th percentiles in a similar fashion to the terms
aggregation:
{"start":
[{"percentile":5,"start":5856674},
{"percentile":10,"start":11845709},
{"percentile":80,"start":125747664},
{"percentile":90,"start":155018520}]}
This aggregate works much like the other ones, but instead of a "top n" or percentile approach, you provide an interval and the histogram
aggregation divides up the records into bins with a width of the interval given, and tells you how many records are in each bin.
["aggregate": {"end": {"histogram": 50000000}}]
This one returns counts as well:
{"end":
[{"end":5.0E7,"count":9311},
{"end":1.0E8,"count":5757},
{"end":1.5E8,"count":3973},
{"end":2.0E8,"count":1650},
{"end":2.5E8,"count":716}]}
In this case the results are telling us that higher ending positions in Genes are increasingly rare, which could or could not be interesting.