AWS Redshift – An Overview
Amazon Redshift is a fast, fully managed, petabyte-scaled data warehouse solution that uses columnar storage to minimize
Input/Output (I/O), provide high data compression rates, and offer fast performance. As a typical Data
Warehouse, it is primarily designed for Online Analytic Processing (OLAP) and Business Intelligence (BI)
and not designed to be used as an Online Transaction Processing (OLTP) tool. It supports ANSI-SQL and is
a massively parallel processing database.
Redshift architecture (See Figure 01) consists of a tightly coupled EC2 Compute nodes cluster. The Redshift
cluster is built on a single availability zone in order to negate any network latency issue between availability
zones. Having all nodes in close proximity will reduce network latency and will improve performance.
Users can create one or more clusters with each cluster having multiple databases. Most of the time there
is one Redshift cluster and additional clusters can be added for resilience purposes. Any cluster can have
two types of nodes, namely a leader node and compute nodes.
The leader node facilitates the communication between the BI client and the compute nodes. Each leader node
has a SQL end point. It coordinates the parallel query execution. When a request comes to the leader node,
it parses the query and generates an execution plan and a compiled code to be executed in the compute nodes.
The compute nodes process the incoming requests in parallel. Each compute node has a dedicated CPU, memory
and a storage. Each compute node can scale out/in and scale up/down (resizing the Redshift cluster). Each
compute node consists of slices. The slices are portions of memory and disk. The data is loaded to the
slices in parallel. It has a “shared nothing” architecture. All compute nodes are independent of each other
and there is no contention between nodes.
Redshift can decide automatically how the data distributes between slices. Also a user can specify one column
as the distribution key. When a query is executed, the query optimizer on the leader node redistributes
the data on the compute nodes as needed in order to perform any joins and aggregations.
With time, as you load more and more data and apply DML commands, the performance can deteriorate. A US client
from the healthcare sector wanted to apply best practices on to their Redshift cluster in order to speed up
and improve its performance.
How Auxenta Helped
Leveraging Auxenta’s Redshift expertise, the team proposed a set of best practices and optimization strategies
for the client’s Redshift cluster installation.
The given Redshift cluster was analyzed based on the following key indicators.
Primary and foreign keys
Queries with alerts
Queries affected by work load management (WLM) configurations
Tables being used in queries with maximum impact on query performance
Percentage of queries being affected by tables
Tables scanned in select join queries
Select queries in peak CPU usage
Tables using peak CPU usage
Queue resources hourly
Queue resources hourly with CPU usage
Query patterns per user/group
WLM configurations for Redshift
Benefits to the client
Identify the causes and performance drawbacks in the Redshift cluster
Guidelines for improvement
Gaining a good knowledge about Redshift optimization knowledge