In our second article, we introduced Athena and its serverless querying capabilities. The rest of tables are left unpartitioned. Partitioned Parquets: 32.5 GB – the largest tables, which are partitioned, are lineitem with 21.5GB and orders with 5GB, with one partition per day each partition has one file and there around 2,000 partitions per table.Parquets without partitions: 31.5 GB – the largest tables are lineitem with 21GB and orders with 4.5GB, also split into 80 files.Raw (CSV): 100 GB – the largest tables are lineitem with 76GB and orders with 16GB, split into 80 files.In that example, we used a dataset from the popular TPC-H benchmark, and generated three versions of the TPC-H dataset: We also introduced the concept of the data lakehouse, as well as giving an example of how to convert raw data (most data landing in data lakes is in a raw format such as CSV) into partitioned Parquet files with Athena and Glue in AWS. In the first article of the series, we discussed how to optimise data lakes by using proper file formats ( Apache Parquet) and other optimisation mechanisms (partitioning). This is the third article in the ‘Data Lake Querying in AWS’ blog series, in which we introduce different technologies to query data lakes in AWS, i.e.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |