Getting Started with Trino
This article is linked to the Advent Calendar 2022: Big Data Tools You’ve Heard Of But Never Tried.
Introduction
I had several candidates I wanted to explore for the Advent Calendar, including DataSketch and Iceberg, but I decided to try out Trino since it caught my interest.
What is Trino?
Trino originated from the members who developed Presto. Around 2018, they left Facebook and started a separate project, which was rebranded as Trino in 2020. You can read about the background of the rebrand in this blog post. Different companies have different approaches to community involvement — something I’ve experienced myself. This blog post provides insight into the Trino development team’s perspective on the relationship between a company and the community. Of course, this is only their side of the story, so if anyone knows more about the situation, I’d love to hear about it.
For a more comprehensive overview, I recommend this article: High-Performance Distributed SQL Engine “Trino” Quick Guide (Japanese). The claim that it’s faster than Hive is intriguing, though I haven’t personally used Hive to compare.
In this article, I’ll share my experience running Trino with Docker.
As of 2022/12/24:
| Watch | Fork | Star |
|---|---|---|
| 162 | 2.1k | 6.9k |
Getting Started with Trino on Docker
First, let’s start a Docker container and run the SQL CLI.
docker run --name trino -d -p 8080:8080 trinodb/trino
# For subsequent runs:
docker run -d -p 8080:8080 trinodb/trino
docker exec -it trino trino
Running queries directly via docker exec makes loading data cumbersome, so I decided to use docker-compose. But first, let’s consider data storage.
From what I could see, Trino doesn’t have its own data storage functionality. The official documentation states that it serves a different purpose from typical RDBMSs like MySQL, which makes sense when you consider the storage aspect as well. Trino 403 Documentation
Trino provides Connectors that let you execute SQL queries against external storage sources, including local files. It’s easiest to think of it as a query engine that can run SQL against external storage. Trino 392 Documentation
12/26 Update: Based on feedback on Twitter, I changed the volumes configuration in docker-compose to only override the properties files while keeping the files needed for startup intact. This makes it easier to use when you only need to change the catalog.
docker-compose.yml
version: '3.7'
services:
trino-eng:
image: 'trinodb/trino:latest'
hostname: trino-eng
ports:
- '8080:8080'
volumes:
- ./catalog:/etc/trino/catalog
networks:
- trino-network
mysql:
image: mysql:latest
hostname: mysql
environment:
MYSQL_ROOT_PASSWORD: admin
MYSQL_USER: admin
MYSQL_PASSWORD: admin
MYSQL_DATABASE: tiny
ports:
- '3306:3306'
volumes:
- ./data:/tmp/data
networks:
- trino-network
networks:
trino-network:
driver: bridge
To read data from a connected source, you need to create a properties file in the catalog directory with the database connection information. Since I used MySQL for this example, I created mysql.properties.
mysql.properties
connector.name=mysql
connection-url=jdbc:mysql://mysql:3306
connection-user=root
connection-password=admin
I loaded data on the MySQL side beforehand — I downloaded the NOAA gsod2021 public dataset and ran count(*).
I ran into a permissions issue when loading data into MySQL via Docker because I logged in as admin instead of root, which meant I didn’t have the privileges to import local files… I had mistakenly thought admin was the root user.
trino> SELECT count(*) FROM mysql.tiny.gsod2021;
_col0
---------
4014939
(1 row)
Query 20221225_112631_00007_2mpa8, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
1.79 [1 rows, 0B] [0 rows/s, 0B/s]
mysql> select count(*) from gsod2021;
+----------+
| count(*) |
+----------+
| 4014939 |
+----------+
1 row in set (1.50 sec)
I ran each query three times and took the median response time. The difference was within the margin of error.
Next, I calculated the average temperature (Fahrenheit) across all observation stations for 2021.
Trino
trino> SELECT avg(temp) FROM mysql.tiny.gsod2021;
_col0
-----------
54.984913
(1 row)
Query 20221225_112649_00008_2mpa8, FINISHED, 1 node
Splits: 1 total, 1 done (100.00%)
27.16 [1 rows, 0B] [0 rows/s, 0B/s]
MySQL
mysql> select avg(temp) from gsod2021;
+--------------+
| avg(temp) |
+--------------+
| 54.984911127 |
+--------------+
1 row in set (28.19 sec)
With just these two operations, the performance was roughly the same. Since I was running Trino with only a single worker, the processing wasn’t going to differ much from MySQL. Please note that this comparison doesn’t leverage Trino’s real strengths.
Conclusion
I was only aware of Presto from when I worked with Hadoop professionally — I mostly just reviewed the results of my colleague’s evaluations. After transitioning from Hadoop to cloud-based platforms, it was about six years before I heard the name again. I had no idea that it had gone through various changes and been rebranded. But discovering that a tool I once knew is still evolving was genuinely exciting.
As stated in the official documentation, Trino isn’t a replacement for RDBs — it’s better described as a SQL engine that can analyze data across multiple sources including cloud and on-premises environments. It’s worth considering for hybrid environments, and even in non-hybrid scenarios, it may be worth evaluating since performance tuning with node configuration might yield improvements. This article didn’t explore the relationship between node count and performance, so I’d like to try that in the future or see results from others who have tested it.
References
We’re rebranding PrestoSQL as Trino @ Trino blog
High-Performance Distributed SQL Engine “Trino” Quick Guide (Japanese)
Installing Trino on Ubuntu Linux 20.04 LTS and Connecting to MySQL (Japanese)