Presto: SQL over big data, NoSQL and relational DBMS

What about using SQL to retrieve data from NoSQL databases? Or to make joins between tables stored in relational databases and big data sources? You can achieve that using Presto, a distributed query execution engine, which supports query execution on a wide range data source types.

Presto overview

To connect to data sources, Presto uses Connectors. Currently, there are connectors available to more than 20 types of data sources, including: Accumulo, Cassandra, MongoDB, Kafka, Kudu, Hive, Elasticsearch, Redis, Oracle, MS SQL Server, PostgreSQL and MySQL.

Presto was originally developed by Facebook, but since September 2019 it is maintained by the PrestoDB Foundation, started by Facebook, Uber, Twitter, and Alibaba. Hence, it is an open source software, which is available to download here. It is usefull for data scientists and for distributed database integration, and may be used in several contexts, including in data warehousing, data analytics operations and OLAP (Online Analytical Processing) workloads.

Architecture Overview

Presto is a distributed system composed by a Coordinator and one or more Workers. Although it may not be the best configuration, you can use the Coordinator and a Worker in the same machine.

Presto architecture overview

The Coordinator is responsible to coordinate query execution. It parsers and analyzes users’ queries and creates global execution plans. Then, the Coordinator assigns processing tasks to the Workers.

Each Worker can connect to one or more data sources (like relational DBMS and NoSQL databases) through the use of Connectors. Then, Workers submmit transformed queries to original data sources using the corresponding Connector.

Data sources execute low level queries and return the results for the Workers. No data is stored at Presto.

How to add data sources to Presto?

You may use several data sources at the same time in Presto, including relational and no-relational ones.

The first step is to install the Coordinator and one or more Workers. Then, you just have to install the right Connector and configure it for your data source. Configuration information include usual connection data, like host, port, username and password. You can add new data sources at any time. And it is also possible to configure more than one source of the same type.

How to submmit queries?

Presto provides an abstract relational view of underlying data sources, even though the used sources do not store data in relational tables. So, you can submmit SQL queries to Presto and it transforms such queries into ones supported by the used data source.

You may use most of common SQL functions and structures, including Joins, Group Bys and Havings. Hence, you may use Presto to execute a join between data stored in Oracle, MongoDB, Hive and Cassandra, for instance.

Also, you can use Gui clients to query Presto or using one of the several client libraries available to connect your application to Presto. Currently available libraries include ones for C, Java, Node.js, PHP, Python, R and Ruby. There also exist ODBC and JDBC drivers for Presto.

Leave a Reply

Your email address will not be published. Required fields are marked *