Posts

How to effectively update table with over terabyte of size with spark (perf tuning tips)

Partitioning: Ensure that your Delta Lake table is partitioned appropriately. Partitioning can significantly speed up queries by restricting the amount of data that needs to be scanned. Caching: If you have enough memory available, you can cache the DataFrame in memory before performing any transformations or updates. This can be done using the .cache() method. Bucketing: If applicable, consider bucketing your table. Bucketing can improve performance when you need to sample or aggregate data. Tuning: Tune the number of shuffle partitions to match your cluster configuration. You can set this value using spark.conf.set("spark.sql.shuffle.partitions", num_partitions). Optimize UDF: Make sure the UDF (User-Defined Function) generate_sha1_hash_key is efficient. You might also explore built-in Spark functions for hash calculations. Cluster Configuration: Ensure that your Databricks cluster is correctly sized to handle a large dataset like this. You may need a cluster with...

Teradata to Azure Synapse Migration - How to Convert Teradata DDL to Azure Synapse DDL

Azure Synapse Pathway simplifies your transition to a cutting-edge data warehouse platform by automating the translation of your current data warehouse code. This tool is not only user-friendly and intuitive but also free to use. By automating code translation, it expedites your migration to Azure Synapse Analytics. Key benefits of Azure Synapse Pathway include: 1. Automated Data Warehouse Migration: Effortlessly migrate your data warehouse to Azure Synapse Analytics with automated code translation. 2. Substantial Cost Savings on Migration: Reduce migration costs significantly, ensuring an efficient and cost-effective transition. 3. Accelerated Migration Timelines: Move from months to mere minutes for the migration process, saving time and resources. 4. DDL Conversion from Teradata https://learn.microsoft.com/en-us/sql/tools/synapse-pathway/azure-synapse-pathway-overview?view=azure-sqldw-latest

Good Article about Apache Flink & AWS KDA

https://www.capitalone.com/tech/cloud/aws-apache-flink/