Designing the Evita Query Language for the GraphQL API

In evitaDB (like in many other databases), in order to get any data, you need to somehow ask which data you want. The GraphQL language is, however, specific and needs a specific syntax.

A set of these questions is called a query. Each query contains several questions or some sort of hints to filter, sort, return or format the desired data. We call these constraints. We have 4 basic types of constraints: head, filter, order and require. Head constraints specify some metadata, such as: which collection of entities will the query be searching. Filter constraints simply filter entities by the defined conditions. Order constraints sort entities by their entity properties (attributes, prices, etc.). At last, require constraints define what the output data will contain: will it be just entities? How rich will these entities be? Will there be some other data like a facet summary, parent entities and so on?

Original Java Query Language

Initially, because evitaDB is embeddable, we created the query language only in Java using basic POJOs and static factory methods for easier creation of individual constraints. These constraints can be nested inside constraint containers to represent more complex queries.

This enabled us to use quite a variety of all kinds of conditions and their combinations and allowed Java developers using the embedded version of evitaDB and us to test evitaDB in a type-safe manner with code completion.

The Query Language for external APIs

Unfortunately, this Java approach cannot be used with external APIs like GraphQL, REST or gRPC. Therefore, we needed another way of declaring queries. Our first attempt was to create a DSL with a parser that would copy the Java design of constraints, but it could be parsed from any string. We achieved this by using the ANTLR4 library to define our DSL from which the lexer and the parser were generated. Thanks to this parser, we were able to parse a query from any string from any API, although it lacked type-safety and code completion (there was no time to build a custom plugin or LSP for IDE syntax validation and autocompletion). We used it only to build the gRPC API because gRPC doesn’t allow simple objects with generics and arbitrary parameters like Java. But GraphQL and REST APIs work with JSON objects, so we wanted to come up with something a little bit different that would fit the JSON language and could be potentially backed by the GraphQL or REST schema (unlike the plain string query).

GraphQL API Query Language

When we were building the GraphQL API we were amazed by the code completion and documentation right at your hands when writing GraphQL queries and how much you can customize the form of the returned data. This gave us an idea to use our internal evitaDB schemas for catalogs and entities to build a mechanism that would generate an entire GraphQL schema from our internal schemas. It would bring an “automated” documentation and enable intelligent code completion that would lead the users on what and how to query and help them avoid mistakes in the query. But we didn’t stop there, we also wanted to take our query language and generate GraphQL schema for it so it can be used with code completion as well as the rest of GraphQL queries as mentioned above.

Our requirements

In order to support all features of our original query language we wanted:

the GraphQL Query Language be similar to the original Java Query Language
to be able to define classifiers for each constraint (to locate data)
to be able to define values for comparison
to be able to define child constraints inside parents constraints
to have both, a single constraint, or an array of multiple constraints
to be able to combine all of the above

In addition to that we wanted to use power of GraphQL schema and editors and enhance it with the following criteria:

see only those constraints that make sense for a particular entity collection
in nested constraints, see only constraints that make sense in that particular context
instead of rather generic constraints where a classifier and any comparable value are inserted as arguments, generate specific versions of those generic constraints based on entity schemas to further tell client what data and data types they can query

Inspiration

We started by researching what other database designers have come up with in this subject of interest. We found several articles and documentations about attempts of creating such DSLs.

Following examples are something we didn’t like too much because this way the editor cannot provide any code completion for a client and would be annoyed writing the query with all the special characters.

Then we came across some other examples which we quite liked for its expressiveness and possibility of code completion.

Our approach

Mainly, we took inspiration from the EdgeDB approach as they were also generating a GraphQL schema for querying from their internal database schema. What we didn’t like about several of these approaches very much was the condition definition being nested inside an object with other conditions with implicit AND condition between them. This creates an unnecessary difficulty for developers to write lots of curly brackets to define simple constraints and complicates stating multiple conditions for the same data in OR conditions (well, it doesn’t but you would have to wrap it in another object). Therefore, our idea was to basically combine those conditions with the key to create a composite key containing a data locator and condition:

A value to this composite key would then be simply a comparable value (or nested child constraints) in some supported data type (in this case data type of attribute code in our entity schema). This would also allow for having multiple statements in the same parent object with different conditions. But we weren’t the first ones to think of that. The article about the GORQ language is showing a similar GraphQL syntax.

Unfortunately, unlike EdgeDB, where there are only object types and their fields, and thus the query language must handle only “generic” fields and object types, evitaDB contains more types of data in entities to query. For example, each entity can have attributes, prices, references, hierarchy references, facets and so on. Each type of data has its own constraints and you can even query inside some of them with other constraints. On top of that, each entity can support only some of these types based on its data.

To easily distinguish between each type of entity data and to prevent having to duplicate conditions for multiple types of data when writing a query, we came up with prefixes for constraints. Each prefix represents on what type of data a constraint can operate:

generic - generic constraint, usually some kind of wrapper like and, or or not
entity - handles properties directly accessible from the entity like the primary key
attribute - can operate on an entity’s attribute values
associatedData - can operate on an entity’s associated data values
price - can operate on entity prices
reference - can operate on entity references
hierarchy - can operate on an entity’s hierarchical data (the hierarchical data may be even referenced from other entities)
facet - can operate on referenced facets to an entity

On top of that, we decided that generic constraints will not use explicit prefixes for more readability and some constraints will not need any classifier. Which led us to following 3 formats of composite keys we support and use:

A complete single simple constraints would look like this:

JSON limitations

This is all due to the fact that JSON objects are very limited and don’t allow for creating some sort of named constructors or at the very least factory methods which would tell us which constraint we are dealing with when parsing the query. That's why we and other developers decided to use JSON keys for this purpose, where the key contains the name (or in our case multiple metadata) of a constraint and a value that contains only comparable values for that constraint.

But this introduced a few new problems - mainly with the child constraints. In Java, we just specify if we want to support a list of constraints or a single constraint as constructor parameters. In JSON, if we want to do that, we first need to wrap each child constraint in another JSON object to have access to names of child constraints but then we have a problem that a client can specify multiple constraints in that wrapper object, even though the constraint may accept only one child constraint. We could just throw an error when the client does that but that would be quite unintuitive and would require a client to submit a query to find out if it has the correct structure. Instead, we decided that each such wrapper container would be translated into an implicit and constraint with an implicit AND relation between inner constraints (and would throw error only in edge cases when this wrapper AND container doesn’t make sense). Such an approach introduces new complexity to the query resolver but on the other hand, it solves nearly all of the problems with child constraints. As a bonus, clients don’t have to use explicit and constraints if they are okay with the default AND relation. This can be useful in constraints, such as the filterBy constraint, which takes only one child constraint, but because the child has to be wrapped inside an implicit and constraint, the client doesn’t have to use the and constraint at all.

Examples of final solution

A generated GraphQL schema of a query looks similar to this:

To illustrate how it can be used in practice, next snippet shows the implicit and condition between equals and startsWith constraints inside a filterBy constraint container:

Other more complex example of or constraint container with inner implicit and containers:

Finally, an example with nested child constraints, that, in this case, would allow completely different constraints than the parent filterBy container allows (there is different set of attributes specified in the relation and the entity scope):

Conclusion

In the end, we chose this format in the hope that it would require less special characters and would read more like English, which could greatly help with the intuitiveness of the language. The disadvantage is the verbosity of the GraphQL query API (and we, of course, didn’t want to bring back COBOL), but we believe that most of the query will be auto-completed by an editor and developers would need to write only a few characters for each constraint. Another argument is that with our approach, most complex queries fit onto one screen without scrolling, because the simple constraints usually take just one line vs. a minimum of three lines, such as in the case of an EdgeDB approach. We discussed this format with several front-end and back-end developers, and they all seemed to agree that in our case, this approach could work much better than the earlier mentioned ones. We applied this approach to the order and require constraints as well, and it worked out quite nicely in comparison to the above-mentioned approaches.

Author: Lukáš HornychDate updated: 12.1.2022