Redshift space utilization query by schema

11/16/2023

from TPC-H, and we will see how Spectrum performs, compared to using data stored in Redshift. In this example we will use the same dataset and queries used in our previous blogs, i.e. We will also examine a feature in Redshift called Spectrum, which allows querying data in S3 then we will walk through a hands-on example to see how Redshift is used. In this article, we will go through the basic concepts of Redshift and also discuss some technical aspects thanks to which the data stored in Redshift can be optimised for querying. Actually, the combination of S3, Athena and Redshift is what AWS proposes as a data lakehouse. We are not going to make a thorough comparison between Athena and Redshift, but if you are interested in the comparison of these two technologies and what situations are more suited to one or the other, you can find interesting articles online such as this one. In this article, we are going to learn about Amazon Redshift, an AWS data warehouse that, in some situations, might be better suited to your analytical workloads than Athena. As we commented, Athena is great for relatively simple ad hoc queries in S3 data lakes even when data is large, but there are situations (complex queries, heavy usage of reporting tools, concurrency) in which it is important to consider alternative approaches, such as data warehousing technologies. In our second article, we introduced Athena and its serverless querying capabilities. The rest of tables are left unpartitioned. Partitioned Parquets: 32.5 GB – the largest tables, which are partitioned, are lineitem with 21.5GB and orders with 5GB, with one partition per day each partition has one file and there around 2,000 partitions per table.

Parquets without partitions: 31.5 GB – the largest tables are lineitem with 21GB and orders with 4.5GB, also split into 80 files.
Raw (CSV): 100 GB – the largest tables are lineitem with 76GB and orders with 16GB, split into 80 files.
In that example, we used a dataset from the popular TPC-H benchmark, and generated three versions of the TPC-H dataset:

We also introduced the concept of the data lakehouse, as well as giving an example of how to convert raw data (most data landing in data lakes is in a raw format such as CSV) into partitioned Parquet files with Athena and Glue in AWS. In the first article of the series, we discussed how to optimise data lakes by using proper file formats ( Apache Parquet) and other optimisation mechanisms (partitioning). However, this is very dependent on which bench you use and having all users' benches configured correctly.īottom line - if you want something to modify your SQL do if before it goes to Redshift.This is the third article in the ‘Data Lake Querying in AWS’ blog series, in which we introduce different technologies to query data lakes in AWS, i.e. Many benches support variable substitution and simple replacements in the SQL can be done by the bench. However, I expect it is unlikely that you are looking to move to an API access model. If you use Redshift data-api you could put a Lambda function in series which performs the SQL modifications you desire (but make sure you get your API permissions right). This will do what you want but you will need a computer to perform this work. The one I've used in the past is pgbounce-rr which pools connections to the the db but also allow for modifications to the SQL before being sent on. The most complete way is to use a front-end system that clients connect to and then this system in turn connects to the db. So I'm going to focus on ways to do this before the database. But first trying to use a database engine for functions beyond querying the database is a waste of horsepower and the road to db lock-in. There are a few ways you try to attack this.

0 Comments

Redshift space utilization query by schema

Leave a Reply.

Author

Archives

Categories