Designing the Evita Query Language for the GraphQL API
In evitaDB (like in many other databases), in order to get any data, you need to somehow ask which data you want. The GraphQL language is, however, specific and needs a specific syntax.
A set of these questions is called a query. Each query contains several questions or some sort of hints to filter, sort,
return or format the desired data. We call these constraints. We have 4 basic types of constraints: head, filter,
order and require. Head constraints specify some metadata, such as: which collection of entities will the query be
searching. Filter constraints simply filter entities by the defined conditions. Order constraints sort entities by their
entity properties (attributes, prices, etc.). At last, require constraints define what the output data will contain:
will it be just entities? How rich will these entities be? Will there be some other data like a facet summary, parent
entities and so on?
Original Java Query Language
Initially, because evitaDB is embeddable, we created the query language only in Java using basic POJOs and static factory
methods for easier creation of individual constraints. These constraints can be nested inside constraint containers to
represent more complex queries.
This enabled us to use quite a variety of all kinds of conditions and their combinations and allowed Java developers using the
embedded version of evitaDB and us to test evitaDB in a type-safe manner with code completion.
The Query Language for external APIs
Unfortunately, this Java approach cannot be used with external APIs like GraphQL, REST or gRPC. Therefore, we needed
another way of declaring queries. Our first attempt was to create
a DSL with a parser that would copy the Java design of
constraints, but it could be parsed from any string. We achieved this by using
the ANTLR4 library to define
our DSL from which the lexer and the parser were generated. Thanks to
this parser, we were able to parse a query from any string from any API, although it lacked type-safety and code
completion (there was no time to build a custom plugin or LSP
for IDE syntax validation and autocompletion). We used it only to build the gRPC API because gRPC doesn’t allow
simple objects with generics and arbitrary parameters like Java. But GraphQL and REST APIs work with JSON objects,
so we wanted to come up with something a little bit different that would fit the JSON language and could be
potentially backed by the GraphQL or REST schema (unlike the plain string query).
GraphQL API Query Language
When we were building the GraphQL API we were amazed by the code completion and documentation right at your hands when
writing GraphQL queries and how much you can customize the form of the returned data. This gave us an idea to use our
internal evitaDB schemas for catalogs and entities to build a mechanism that would generate an entire GraphQL schema
from our internal schemas. It would bring an “automated” documentation and enable intelligent code completion that would
lead the users on what and how to query and help them avoid mistakes in the query. But we didn’t stop there, we also
wanted to take our query language and generate GraphQL schema for it so it can be used with code completion as well as
the rest of GraphQL queries as mentioned above.
Our requirements
In order to support all features of our original query language we wanted:
the GraphQL Query Language be similar to the original Java Query Language
to be able to define classifiers for each constraint (to locate data)
to be able to define values for comparison
to be able to define child constraints inside parents constraints
to have both, a single constraint, or an array of multiple constraints
to be able to combine all of the above
In addition to that we wanted to use power of GraphQL schema and editors and enhance it with the following criteria:
see only those constraints that make sense for a particular entity collection
in nested constraints, see only constraints that make sense in that particular context
instead of rather generic constraints where a classifier and any comparable value are inserted as arguments, generate
specific versions of those generic constraints based on entity schemas to further tell client what data and data types
they can query
Inspiration
We started by researching what other database designers have come up with in this subject of interest. We found several
articles and documentations about attempts of creating
such DSLs.
Following examples are something we didn’t like too much because this way the editor cannot provide any code completion for
a client and would be annoyed writing the query with all the special characters.
Mainly, we took inspiration from the EdgeDB approach as they were also generating a GraphQL schema for querying from
their internal database schema. What we didn’t like about several of these approaches very much was the condition
definition being nested inside an object with other conditions with implicit AND condition between them. This creates an
unnecessary difficulty for developers to write lots of curly brackets to define simple constraints and complicates
stating multiple conditions for the same data in OR conditions (well, it doesn’t but you would have to wrap it in
another object). Therefore, our idea was to basically combine those conditions with the key to create a composite key
containing a data locator and condition:
A value to this composite key would then be simply a comparable value (or nested child constraints) in some supported data
type (in this case data type of attribute code in our entity schema). This would also allow for having multiple statements
in the same parent object with different conditions. But we weren’t the first ones to think of that.
The article about the GORQ language is showing a similar
GraphQL syntax.
Unfortunately, unlike EdgeDB, where there are only object types and their fields, and thus the query
language must handle only “generic” fields and object types, evitaDB contains more types of data in entities to
query. For example, each entity can have attributes, prices, references, hierarchy references, facets and so
on. Each type of data has its own constraints and you can even query inside some of them with other constraints. On
top of that, each entity can support only some of these types based on its data.
To easily distinguish between each type of entity data and to prevent having to duplicate conditions for multiple types of
data when writing a query, we came up with prefixes for constraints. Each prefix represents on what type of data a
constraint can operate:
generic - generic constraint, usually some kind of wrapper like and, or or not
entity - handles properties directly accessible from the entity like the primary key
attribute - can operate on an entity’s attribute values
associatedData - can operate on an entity’s associated data values
price - can operate on entity prices
reference - can operate on entity references
hierarchy - can operate on an entity’s hierarchical data (the hierarchical data may be even referenced from other
entities)
facet - can operate on referenced facets to an entity
On top of that, we decided that generic constraints will not use explicit prefixes for more readability and some
constraints will not need any classifier. Which led us to following 3 formats of composite keys we support and use:
A complete single simple constraints would look like this:
JSON limitations
This is all due to the fact that JSON objects are very limited and don’t allow for creating some sort of named
constructors or at the very least factory methods which would tell us which constraint we are dealing with when parsing the
query. That's why we and other developers decided to use JSON keys for this purpose, where the key contains the name (or
in our case multiple metadata) of a constraint and a value that contains only comparable values for that constraint.
But this introduced a few new problems - mainly with the child constraints. In Java, we just specify if we want to support a
list of constraints or a single constraint as constructor parameters. In JSON, if we want to do that, we first need to
wrap each child constraint in another JSON object to have access to names of child constraints but then we have a
problem that a client can specify multiple constraints in that wrapper object, even though the constraint may accept
only one child constraint. We could just throw an error when the client does that but that would be quite unintuitive
and would require a client to submit a query to find out if it has the correct structure. Instead, we decided that each
such wrapper container would be translated into an implicit and constraint with an implicit AND relation between
inner constraints (and would throw error only in edge cases when this wrapper AND container doesn’t make sense). Such
an approach introduces new complexity to the query resolver but on the other hand, it solves nearly all of the problems
with child constraints. As a bonus, clients don’t have to use explicit and constraints if they are okay with the
default AND relation. This can be useful in constraints, such as the filterBy constraint, which takes only one child constraint, but
because the child has to be wrapped inside an implicit and constraint, the client doesn’t have to use the and
constraint at all.
Examples of final solution
A generated GraphQL schema of a query looks similar to this:
To illustrate how it can be used in practice, next snippet shows the implicit and condition between equals and
startsWith constraints inside a filterBy constraint container:
Other more complex example of or constraint container with inner implicit and containers:
Finally, an example with nested child constraints, that, in this case, would allow completely different constraints than
the parent filterBy container allows (there is different set of attributes specified in the relation and the entity
scope):
Conclusion
In the end, we chose this format in the hope that it would require less special characters and would read more like
English, which could greatly help with the intuitiveness of the language. The disadvantage is the verbosity of the
GraphQL query API (and we, of course, didn’t want
to bring back COBOL),
but we believe that most of the query will be auto-completed by an editor and developers would need to write only a
few characters for each constraint. Another argument is that with our approach, most complex queries fit onto one
screen without scrolling, because the simple constraints usually take just one line vs. a minimum of three lines, such as
in the case of an EdgeDB approach. We discussed this format with
several front-end and back-end developers, and they all seemed to agree that in our case, this approach could work much
better than the earlier mentioned ones. We applied this approach to the order and require constraints as well, and it
worked out quite nicely in comparison to the above-mentioned approaches.