Soumil N Shah

Software Developer Based in NewYork City

Bachelor in Electronic Engineering

Master in Electrical Engineering

Masters in Computer Engineering


I am Soumil Nitin Shah, an accomplished Data Engineer Team Lead proficient in AWS, PySpark, and data platform construction. Acknowledged for my pivotal role in crafting a scalable Data Ingestion Framework handling over 2TB of data monthly, I specialize in innovative solutions like the "LakeBoost" framework. Committed to knowledge sharing, my YouTube channel boasts a subscriber base of 37,000, reflecting my dedication to educating and contributing to the tech community.

Learn More about Software Engineering and AWS on YouTube Channel

Subscribe

People Recommendation

YouTube Subscribers

42,000+

Subscribers

LinkedIn Followers

6,700+

Followers

GitHub Followers

1,000+

Followers

Total Blogs Written

200+

Blogs

Total Videos on YouTube

1,600+

Videos

Monthly YouTube Views

80.3K+

Views

Monthly YouTube Watch Time

3.4K

Hours

Monthly YouTube Impressions

745.9K

Impressions





Popular Articles
How to Use S3 Object Tags for Iceberg Tables Created by EMR Serverless to Move Expired Snapshots into Glacier or Delete Them by life cycle policy
How to Use S3 Object Tags for Iceberg Tables Created by EMR Serverless to Move Expired Snapshots into Glacier or Delete Them by life cycle policy

The article explains how to use S3 object tags with Apache Iceberg tables created in Amazon EMR Serverless. It outlines how tagging S3 objects can help in managing and optimizing large-scale data processing workflows, such as partitioning and filtering data. The approach provides an efficient way to track data across different stages of processing, improving performance and reducing costs. The use of tags with Iceberg tables enhances data governance, making it easier to manage complex analytics pipelines.


Read More

November 27, 2024

Learn How to Run Spark Streaming Hudi Jobs on New EMR Serverless 7.5.0
Learn How to Run Spark Streaming Hudi Jobs on New EMR Serverless 7.5.0

The article discusses leveraging Amazon EMR Serverless to run Apache Spark Streaming jobs for real-time data processing, focusing on integration with Apache Hudi. It highlights the simplicity and scalability of EMR Serverless for handling continuous data streams such as clickstreams, IoT data, or social media analytics. The platform eliminates the need for infrastructure management while supporting robust configurations for Spark applications. Key features include auto-scaling, cost-efficiency, and compatibility with a variety of data


Read More

November 24, 2024

Sync Tables in All Three Formats(Hudi|Delta|Iceberg) with XTable and AWS Lambda: Automate, Schedule, or Trigger On-Demand
Sync Tables in All Three Formats(Hudi|Delta|Iceberg) with XTable and AWS Lambda: Automate, Schedule, or Trigger On-Demand

The article discusses how to synchronize tables across three popular data formats—Hudi, Delta Lake, and Iceberg—using AWS services. It highlights the challenges and solutions for managing data consistency and performance when working with different storage formats. The author explains how AWS Glue, Lake Formation, and other AWS tools can facilitate the seamless synchronization of tables, ensuring efficient data operations across diverse formats. This approach enables organizations to leverage the strengths of each format while maintaining data integrity and governance.


Read More

November 22, 2024

Federated Queries with Trino: Joining Data Across Multiple MySQL , PostgreSQL(Vice Versa) Hands on labs for Begineers
Federated Queries with Trino: Joining Data Across Multiple MySQL , PostgreSQL(Vice Versa) Hands on labs for Begineers

The article explores how to perform federated queries in Trino to join data across multiple MySQL databases. It explains the process of configuring Trino to connect to different MySQL instances and execute cross-database joins, enabling seamless querying of distributed data. The author provides practical examples and best practices for optimizing query performance and ensuring efficient data integration. This approach allows users to leverage Trino’s powerful querying capabilities to work with data spread across multiple databases without needing to move or replicate it.


Read More

November 21, 2024

Building a Medallion Architecture with EMR Serverless and Apache Iceberg: An Incremental Data Processing Guide with Hands-On Code
Building a Medallion Architecture with EMR Serverless and Apache Iceberg: An Incremental Data Processing Guide with Hands-On Code

The article explores how to implement Medallion Architecture using EMR Serverless and Apache Spark. It outlines the three key layers—Bronze, Silver, and Gold—that structure the flow of raw, processed, and aggregated data, respectively, in modern data lakes. The author emphasizes the benefits of using EMR Serverless to simplify infrastructure management while ensuring scalability and cost efficiency. This approach enables organizations to create a robust, scalable data pipeline that supports clean, high-quality analytics for business intelligence.


Read More

November 17, 2024

How to Use Publish-Audit-Merge Workflow in Apache Iceberg: A Beginner’s Guid
How to Use Publish-Audit-Merge Workflow in Apache Iceberg: A Beginner’s Guid

The article provides a beginner-friendly guide on using Apache Iceberg to publish and audit merge workflows. It explains how to manage and track data changes in Iceberg tables through version control and audit logs. The author walks through the process of using merge operations to efficiently update data, ensuring consistency and traceability. This workflow helps improve data governance, enabling users to handle large-scale datasets while maintaining high data integrity and transparency.


Read More

November 3, 2024

Moving Large Tables from Snowflake to S3 Using the COPY INTO Command and Hudi Bootstrapping to Build Data Lakes | Hands-On Labs
Moving Large Tables from Snowflake to S3 Using the COPY INTO Command and Hudi Bootstrapping to Build Data Lakes | Hands-On Labs

The article explains how to move large tables from Snowflake to S3 using the COPY command. It provides a step-by-step guide on optimizing the process, ensuring minimal disruption to operations during data migration. The author highlights key considerations such as partitioning, parallel processing, and managing data consistency for efficient transfers. This approach offers a scalable solution for migrating large datasets from Snowflake to S3, enabling easier integration with other cloud-based analytics tools.


Read More

October 26, 2024

Getting Started with Apache Polaris Locally Using Docker Compose and Register Your Iceberg Tables | Hands-on Labs for Begineers
Getting Started with Apache Polaris Locally Using Docker Compose and Register Your Iceberg Tables | Hands-on Labs for Begineers

The article provides a step-by-step guide for getting started with Apache Polaris locally using Docker Compose. It walks through the process of setting up a local environment to run Polaris, an open-source distributed query engine. The author covers essential setup instructions, including configuring Docker and deploying Polaris components, to help users quickly spin up a local instance for development and testing. This guide is ideal for developers looking to explore Apache Polaris without needing a full cloud-based infrastructure.


Read More

October 20, 2024

Learn How to Use ClickHouse Materialized Views to Move Data from Kafka Topics into ClickHouse Tables Real Time : A Beginner's Guide with Hands-On Labs
Learn How to Use ClickHouse Materialized Views to Move Data from Kafka Topics into ClickHouse Tables Real Time : A Beginner's Guide with Hands-On Labs

The article explains how to use ClickHouse materialized views to efficiently move and transform data. It demonstrates how materialized views can automate data processing by capturing and storing the results of queries as new data is inserted. The author provides practical examples for setting up materialized views to streamline ETL workflows and improve query performance. This approach allows users to optimize their data pipelines and enhance the scalability of their ClickHouse database for real-time analytics.


Read More

September 21, 2024

How to Use External Python Packages in a PySpark Job on EMR Serverless: A Beginner’s Guide
How to Use External Python Packages in a PySpark Job on EMR Serverless: A Beginner’s Guide

The article explains how to use external Python packages in a PySpark job on EMR Serverless. It outlines the steps to package and install Python dependencies, allowing users to extend the functionality of their PySpark jobs with additional libraries. The author highlights the integration process with Amazon S3 for storing custom Python packages, ensuring smooth execution of the job in a serverless environment. This approach provides flexibility and scalability for running data processing tasks while leveraging external Python tools in a cost-effective manner.


Read More

September 7, 2024

Master Apache Hudi Streamer: 15+ Hands-On Labs, Exercise Materials, and Videos - The Go-To Guide for Companies, Data Leaders, Engineers, and Developer
Master Apache Hudi Streamer: 15+ Hands-On Labs, Exercise Materials, and Videos - The Go-To Guide for Companies, Data Leaders, Engineers, and Developer

In this blog post, titled "Master Apache Hudi Streamer with 15 Hands-On Labs and Exercise Files," the author provides an in-depth guide to mastering Apache Hudi Streamer. The article includes 15 comprehensive hands-on labs and exercise files, designed to help readers gain practical experience and a deep understanding of Apache Hudi Streamer. It's an essential resource for data professionals looking to enhance their skills in data ingestion and management using Hudi.


Read More

July 20, 2024

Fast GeoSearch on Data Lakes: Learn How to Efficiently Build a Geo Search Using Apache Hudi for Lightning-Fast Data Retrieval Based on Geohashing
Fast GeoSearch on Data Lakes: Learn How to Efficiently Build a Geo Search Using Apache Hudi for Lightning-Fast Data Retrieval Based on Geohashing

In this blog post, titled "Fast GeoSearch on Data Lakes: Learn How to Efficiently Build a Geo Search Using Apache Hudi for Lightning-Fast Data Retrieval Based on Geohashing," the author delves into the intricacies of geospatial data analysis. The article explores how to leverage Apache Hudi to build efficient geo search capabilities. It provides step-by-step methods for implementing geohashing and record-level indexing to achieve rapid data retrieval in data lakes, making it an invaluable resource for data engineers and analysts aiming to enhance their geospatial query performance


Read More

July 19, 2024

Implementing Keyword Search in Hudi: Building Inverted Indexes with Record Level Index, Metadata Indexing and Point Lookups | Text search on data lake
Implementing Keyword Search in Hudi: Building Inverted Indexes with Record Level Index, Metadata Indexing and Point Lookups | Text search on data lake

In my latest article, I delve into implementing efficient keyword search capabilities using Apache Hudi. By building inverted indexes and leveraging record-level indexing, I demonstrate how to optimize data lake queries for scalability and performance. Join me as I explore techniques for integrating these powerful search functionalities into your data ecosystem, empowering faster insights and improved data accessibility.


Read More

July 14, 2024

How to Use OpenAI Vector Embedding and Store Large Vectors in Apache Hudi for Cost-Effective Data Storage with MiniO and Empowering AI Applications
How to Use OpenAI Vector Embedding and Store Large Vectors in Apache Hudi for Cost-Effective Data Storage with MiniO and Empowering AI Applications

Discover how to leverage OpenAI's vector embeddings to efficiently store and query large vectors in Apache Hudi. In my latest piece, I explore the integration of vector storage with Hudi's data lake architecture, highlighting its impact on scalability and real-time querying. Join me as I delve into practical strategies for incorporating advanced vector handling into your data workflows, enabling enhanced analytics and data-driven decision-making.


Read More

July 10, 2024

4 Different Ways to fetch Apache Hudi Commit time in Python and PySpark
4 Different Ways to fetch Apache Hudi Commit time in Python and PySpark

Explore four distinct approaches to retrieve Apache Hudi commit times using Python. This LinkedIn article provides insights into leveraging Hudi's APIs, querying metadata tables, and utilizing Spark SQL to efficiently fetch commit times, offering practical examples and code snippets for seamless implementation and integration in data workflows.


Read More

June 20, 2024

Multiple Spark Writers with Apache Hudi
Multiple Spark Writers with Apache Hudi

Discover effective strategies for implementing multiple Spark writers with Apache Hudi. This LinkedIn post explores techniques to manage concurrent writes, optimize performance, and maintain data consistency across distributed environments using Hudi's capabilities, supported by practical examples and deployment considerations.


Read More

June 4, 2024

Ingesting Data from Apache Pulsar Using Hudi Delta Streamer: A Step-by-Step Guide
Ingesting Data from Apache Pulsar Using Hudi Delta Streamer: A Step-by-Step Guide

Explore effective methods for ingesting data from Apache Pulsar using Hudi DeltaStreamer. This LinkedIn post delves into configuring DeltaStreamer for seamless data ingestion, optimizing performance, and ensuring reliable integration between Apache Pulsar and Apache Hudi, supported by practical examples and deployment tips.


Read More

May 30, 2024

Mastering DeltaStreamer: Building Slowly Changing Dimensions with Incremental Record Fetching and SQL-based Trasnformer
Mastering DeltaStreamer: Building Slowly Changing Dimensions with Incremental Record Fetching and SQL-based Trasnformer

Discover advanced techniques for mastering DeltaStreamer and constructing slowly changing dimensions (SCDs) in your data pipelines. This LinkedIn article offers insights into optimizing DeltaStreamer configurations, handling dimension updates efficiently, and ensuring data consistency through practical examples and best practices.


Read More

May 22, 2024

Incremental ETL from Hudi Tables Using DeltaStreamer and Using Broadcast Joins for Faster data Processing
Incremental ETL from Hudi Tables Using DeltaStreamer and Using Broadcast Joins for Faster data Processing

The article provides a guide on performing incremental ETL from Apache Hudi tables using DeltaStreamer and Apache Kafka. It outlines the configuration steps and processes for setting up the data pipeline, including key parameters and options for efficient data extraction and transformation. Detailed examples and best practices are provided to help users effectively manage incremental data ingestion and processing.


Read More

May 20, 2024

Serving Data from Hudi Tables via Micro-services Using Lambdas: Scaling for Thousands of Requests
Serving Data from Hudi Tables via Micro-services Using Lambdas: Scaling for Thousands of Requests

The article explains how to serve data from Apache Hudi tables via microservices using Spring Boot. It covers the setup process, including the necessary configurations and dependencies, and demonstrates how to create RESTful APIs to access Hudi data. Practical examples and step-by-step instructions are provided to help users effectively integrate Hudi with microservices for real-time data serving.


Read More

May 12, 2024

Real-time Universal DataLakeHouse: Harnessing Debezium, Kafka, DeltaStreamer, HiveMetastore, MiniO, and Trino Data Freshness <5min
Real-time Universal DataLakeHouse: Harnessing Debezium, Kafka, DeltaStreamer, HiveMetastore, MiniO, and Trino Data Freshness <5min

Explore the concept of a real-time universal DataLakehouse powered by Debezium. This LinkedIn article discusses leveraging Debezium for change data capture (CDC), integrating it with Apache Hudi to maintain a unified DataLakehouse architecture. The post covers setup, configuration, and practical use cases to achieve seamless data synchronization and processing, highlighting the benefits of combining these technologies for modern data infrastructure.


Read More

April 7, 2024

Hands-On Guide: Reading Data from Hudi Tables Incrementally, Joining with Delta Tables using HudiStreamer and SQL-Based Transformer
Hands-On Guide: Reading Data from Hudi Tables Incrementally, Joining with Delta Tables using HudiStreamer and SQL-Based Transformer

The article provides a hands-on guide to reading data from Apache Hudi tables and joining it with Delta tables using Spark. It details the process of setting up the environment, configuring the necessary dependencies, and executing the queries for data integration. Practical examples and step-by-step instructions are included to help users effectively manage and query data across Hudi and Delta Lake.


Read More

April 3, 2024

Record Level Indexing in Apache Hudi Delivers 70% Faster Point Lookups
Record Level Indexing in Apache Hudi Delivers 70% Faster Point Lookups

The article highlights the advantages of record-level indexing in Apache Hudi, which can deliver up to 70% faster point lookups. It discusses how this feature improves query performance by efficiently locating and retrieving specific records. Key implementation details and performance benchmarks are provided to illustrate the significant speed enhancements achieved through this indexing method.


Read More

March 30, 2024

Building a Universal DataLakeHouse with Apache XTable, MinIO, and StarRocks and DeltaStreamer and Interoperate between Hudi , IceBerg and delta Tables
Building a Universal DataLakeHouse with Apache XTable, MinIO, and StarRocks and DeltaStreamer and Interoperate between Hudi , IceBerg and delta Tables

The article describes how to build a universal data lakehouse using Apache Hudi, XTable, MinIO, and StarRocks. It explains the integration of these technologies to create a robust and scalable data architecture that supports both data lake and data warehouse functionalities. Key components and configurations are highlighted to demonstrate the process of achieving efficient data management and analytics.


Read More

March 29, 2024

A Simple Config-Driven Python Template for Rapid DMS to S3 Integration | Single Task per Table Strategy
A Simple Config-Driven Python Template for Rapid DMS to S3 Integration | Single Task per Table Strategy

The article outlines a simple, configuration-driven Python template for rapid Data Management System (DMS) and S3 integration. It emphasizes the efficiency and flexibility of using a template-based approach to streamline data migration and integration processes. Key features and configurations are discussed to ensure seamless and efficient integration between DMS and Amazon S3.


Read More

March 26, 2024

Mastering Incremental ETL with DeltaStreamer and SQL-Based Transformer
Mastering Incremental ETL with DeltaStreamer and SQL-Based Transformer

The article explores mastering incremental ETL processes using DeltaStreamer with an SQL-based transformer. It emphasizes the advantages of SQL-based transformations for managing incremental data ingestion, detailing how this approach simplifies ETL workflows and enhances data processing efficiency. Key practices and configurations for effective implementation are discussed to optimize performance in Apache Hudi environments.


Read More

March 17, 2024

Learn How to use DeltaStreamer with XTable to Interoperate with Hudi and Iceberg using EMR Serverless Hands-on Labs
Learn How to use DeltaStreamer with XTable to Interoperate with Hudi and Iceberg using EMR Serverless Hands-on Labs

In this blog, we explore how to leverage DeltaStreamer with XTable to seamlessly interoperate between Hudi and Iceberg. Discover the steps involved in building metadata and facilitating smooth data integration across different storage formats. Join us as we delve into the process of using DeltaStreamer and XTable to streamline data operations and drive insights in your organization.


Read More

March 16, 2024

Simplified Delta Streamer Job Management: A Structured Approach for Efficient Data Processing
Simplified Delta Streamer Job Management: A Structured Approach for Efficient Data Processing

The article discusses strategies to manage Delta Streamer jobs effectively by implementing a structured approach. It highlights the importance of optimizing configurations, streamlining job execution, and enhancing performance within Apache Hudi data pipelines. The article outlines key techniques and best practices to ensure efficient job management.


Read More

March 16, 2024

How to Properly Handle Updates and Deletes in Your Glue Hudi Spark Jobs When Working with CDC Data: Utilizing the _hoodie_is_deleted Flag
How to Properly Handle Updates and Deletes in Your Glue Hudi Spark Jobs When Working with CDC Data: Utilizing the _hoodie_is_deleted Flag

n this blog, we delve into the best practices for properly handling updates and deletes in your Glue-Hudi-Spark setup. Learn how to effectively manage data changes and maintain data integrity within your data lake architecture. Join us as we uncover the key strategies and techniques for handling updates and deletes, ensuring smooth data operations and maximizing the value of your data assets.


Read More

March 14, 2024

Leveraging OneTable with DeltaStreamer to Build Hudi, Iceberg, and Delta Lakes: A Hands-On Session with Labs
Leveraging OneTable with DeltaStreamer to Build Hudi, Iceberg, and Delta Lakes: A Hands-On Session with Labs

In this blog, we explore the power of leveraging OneTable DeltaStreamer to build seamless integration between Hudi, Iceberg, and Delta formats. Discover how this innovative approach simplifies the process of syncing and managing data across different storage formats. Join us as we uncover the steps to harnessing OneTable DeltaStreamer's capabilities, enabling efficient data operations and driving value for your organization.


Read More

March 5, 2024

Learn How to use Hudi DeltaStreamer with Hudi 0.14 on AWS Glue: A Seamless Data Ingestion Guide
Learn How to use Hudi DeltaStreamer with Hudi 0.14 on AWS Glue: A Seamless Data Ingestion Guide

n this blog, we dive into the process of using Hudi DeltaStreamer 0.14 with AWS Glue for seamless data integration. We explore how DeltaStreamer, coupled with AWS Glue, streamlines the process of ingesting and processing data into Apache Hudi tables. Join us as we uncover the steps involved in leveraging these technologies to build efficient and scalable data pipelines, driving insights and innovation in your organization.


Read More

February 29, 2024

Building an Open Source Data Lake House with Hudi, Postgres Hive Metastore, Minio, and StarRocks
Building an Open Source Data Lake House with Hudi, Postgres Hive Metastore, Minio, and StarRocks

In this blog, we embark on the journey of building an open-source data lake using Apache Hudi, Postgres, and Hive. We delve into the intricacies of integrating these powerful technologies to construct a robust and scalable data lake architecture. Join us as we explore the steps involved in leveraging Apache Hudi alongside Postgres and Hive to build a data lake that drives actionable insights and accelerates innovation.


Read More

February 6, 2024

Learn How to Move Data From MongoDB to Apache Hudi Using PySpark
Learn How to Move Data From MongoDB to Apache Hudi Using PySpark

In this blog, we embark on a journey to learn how to move data from MongoDB to Apache Hudi using PySpark. PySpark serves as a powerful tool for extracting data from MongoDB and integrating it seamlessly into the Apache Hudi ecosystem. Join us as we explore the intricacies of this process, empowering you to leverage the combined capabilities of MongoDB and Apache Hudi for efficient data management and analytics.


Read More

January 20, 2024

Deleting Items from Apache Hudi using Delta Streamer in UPSERT Mode with Kafka Avro Messages
Deleting Items from Apache Hudi using Delta Streamer in UPSERT Mode with Kafka Avro Messages

In this blog, we delve into the process of deleting items from Apache Hudi using Delta Streamer Upsert. We explore how Delta Streamer, a powerful tool in the Hudi ecosystem, facilitates the deletion of data records seamlessly. Join us as we uncover the intricacies of this process and its significance in maintaining data integrity and ensuring compliance with evolving regulatory requirements.


Read More

January 18, 2024

From Datalake to Microservices: Unleashing the Power of Apache Hudi's Record Level Index with FastAPI and Spark Connect
From Datalake to Microservices: Unleashing the Power of Apache Hudi's Record Level Index with FastAPI and Spark Connect

In this blog, we delve into the transformative capabilities of Apache Hudi within modern data lake architectures. From elevating data quality to enabling real-time analytics, Apache Hudi empowers organizations to harness the full potential of their data. Join us as we explore the journey from data lakes to microservices and uncover the pivotal role Apache Hudi plays in driving data-driven insights and innovation.


Read More

January 1, 2024

Getting Started with HUDI CLI on Local Machine Using Docker and Connecting to Your S3 Data
Getting Started with HUDI CLI on Local Machine Using Docker and Connecting to Your S3 Data

In his recent blog post, titled "Getting Started with HUDI CLI on Local Machine Using Docker and Connecting to Your S3 Data," Soumil Shah, a Lead Data Engineer with expertise in AWS, ELK, DynamoDB, and Apache Hudi, provides a comprehensive guide to empower local development with Apache Hudi CLI. The step-by-step instructions cover configuring the AWS profile, cloning the project, spinning up an AWS Glue container locally using Docker, creating a setup script, running the Hudi CLI, and connecting to S3 data. Shah emphasizes the simplicity and convenience of the setup, providing a shell script for easy installation. This guide allows users to seamlessly interact with their S3 data using powerful Hudi CLI commands, unlocking the potential of Apache Hudi for enhanced data management workflows on their local machines. The blog post invites readers to explore the significance of local Hudi CLI development and its transformative impact on data workflows.


Read More

December 30, 2023

Getting Started with Apache Hudi using DBT and Spark Backend with Glue Hive Metastore Locally in Minutes
Getting Started with Apache Hudi using DBT and Spark Backend with Glue Hive Metastore Locally in Minutes

In his recent blog post, titled "Getting Started with Apache Hudi using DBT and Spark Backend with Glue Hive Metastore Locally in Minutes," Soumil Shah, a Lead Data Engineer with expertise in AWS, ELK, DynamoDB, and Apache Hudi, provides a comprehensive guide for setting up a local environment to run Apache Hudi with Spark as the backend, DBT for analytics, and AWS Glue Hive Metastore. The step-by-step instructions include creating an AWS Glue profile, editing Docker Compose files, starting a Docker container with Jupyter and Spark, and configuring DBT dependencies. By following these steps, users can establish a powerful stack for exploring and analyzing data locally. Soumil Shah encourages customization of the setup based on specific requirements, allowing individuals to harness the capabilities of Apache Hudi for data versioning, Spark for processing, DBT for analytics, and Glue Hive Metastore for efficient metadata management. For those interested in exploring Shah's detailed instructions, the blog post promises a hands-on approach to empower data exploration and manipulation using cutting-edge data technologies. The setup encompasses the integration of Apache Hudi, Spark, DBT, and AWS Glue Hive Metastore, providing a robust foundation for data analysis


Read More

December 24, 2023

Simplifying Big Data: Setting Up Spark SQL, Hive Thrift Server, and Hudi with Beeline in Minutes
Simplifying Big Data: Setting Up Spark SQL, Hive Thrift Server, and Hudi with Beeline in Minutes

n his latest blog post, titled "Simplifying Big Data: Setting Up Spark SQL, Hive Thrift Server, and Hudi with Beeline in Minutes," Soumil Shah, a Lead Data Engineer with expertise in AWS, ELK, DynamoDB, and Apache Hudi, provides a detailed guide on establishing a robust environment for big data processing. Focusing on the integration of Apache Spark, Hive Thrift Server, and Hudi, Shah outlines the steps to launch the Thrift Server, connect using Beeline, and create Hudi tables with Spark SQL queries. The blog emphasizes the efficiency of this setup in managing and querying large datasets, making it suitable for real-time analytics, large-scale data processing, and data warehousing projects. Shah's comprehensive instructions empower readers to explore the world of Spark SQL, Hive Thrift Server, and Hudi with Beeline, enhancing their capabilities in the realm of big data analytics and processing.


Read More

December 11, 2023

Real-Time Data Processing with Postgres, Debezium, Kafka, Schema Registry, and Delta Streamer Guide for Begineers
Real-Time Data Processing with Postgres, Debezium, Kafka, Schema Registry, and Delta Streamer Guide for Begineers

In the blog post titled "Real-Time Data Processing with Postgres, Debezium, Kafka, Schema Registry, and Delta Streamer Guide for Beginners," Soumil Shah introduces a powerful stack for establishing a robust real-time data processing pipeline. The stack comprises Postgres as the source database, Debezium for Change Data Capture (CDC), Kafka for event streaming, Schema Registry for Avro serialization, and Delta Streamer for continuous data ingestion into a Hudi-based data lake. The guide provides a hands-on walkthrough, including setting up Docker Compose, configuring Postgres Debezium Connector, and submitting a Spark job for Delta Streamer. The article highlights the benefits of this stack, such as low-latency data processing, scalability, and data consistency. By leveraging these technologies, organizations can build real-time analytics, monitoring, and decision-making systems, gaining a competitive edge in the fast-evolving business environment. The detailed instructions and explanations make it accessible for beginners and seasoned professionals alike. Readers are encouraged to explore the associated GitHub repository for further implementation details.


Read More

November 24, 2023

RFC-14: Step-by-Step Guide for Incremental Data Pull from Postgres to Hudi using DeltaStreamer
RFC-14: Step-by-Step Guide for Incremental Data Pull from Postgres to Hudi using DeltaStreamer

The article by Soumil Shah, titled "RFC-14: Step-by-Step Guide for Incremental Data Pull from Postgres to Hudi using DeltaStreamer," provides a comprehensive walkthrough of the process involved in incrementally pulling data from a PostgreSQL database to Apache Hudi using DeltaStreamer. Apache Hudi, an open-source data management framework, streamlines incremental data processing on large datasets. The guide covers essential steps such as setting up PostgreSQL, creating tables, inserting sample data, defining Hudi configurations, and submitting a DeltaStreamer job. The article emphasizes the intelligent updating capabilities of Hudi when changes are made to PostgreSQL data. By following these step-by-step instructions, users can integrate Hudi into their data pipelines for more robust and scalable solutions, enhancing data management and processing efficiency. For advanced configurations and additional details, readers are encouraged to refer to the official Hudi RFC-14 documentation.


Read More

November 22, 2023

Hudi Streamer (Delta Streamer) Hands-On Guide: Local Ingestion from Parquet Source
Hudi Streamer (Delta Streamer) Hands-On Guide: Local Ingestion from Parquet Source

In his latest blog post, titled "Hudi Streamer (Delta Streamer) Hands-On Guide: Local Ingestion from Parquet Source," Soumil Shah explores the practical application of Apache Hudi's Hudi Streamer component. This hands-on guide provides step-by-step instructions for locally ingesting data from a Parquet source into Hudi format, eliminating the need for an EMR cluster. Covering essential steps such as software installation, dataset download, and configuration file creation, the guide empowers readers to independently explore Hudi Streamer's capabilities, emphasizing real-time data pipelines and efficient data ingestion into Hudi tables. Readers are encouraged to follow the provided link for a comprehensive understanding and hands-on experience with Hudi Streamer.


Read More

November 19, 2023

Breaking Down Data Silos: A Hands-On Guide to Omni-Directional Conversion with OneTable | Hands on Labs
Breaking Down Data Silos: A Hands-On Guide to Omni-Directional Conversion with OneTable | Hands on Labs

In my recent blog post, "Breaking Down Data Silos: A Hands-On Guide to Omni-Directional Conversion with OneTable | Hands on Labs," I explore the transformative capabilities of OneTable, an omni-directional converter for table formats. The blog elucidates the significance of OneTable in promoting interoperability within data lakes and its pivotal role in the evolving landscape of lake house architecture. OneTable acts as a bridge to data harmony, supporting widely adopted open-source table formats such as Apache Hudi, Apache Iceberg, and Delta Lake. The hands-on labs provide a practical demonstration of OneTable's implementation, showcasing its ability to seamlessly convert data between formats like Hudi and Delta. The blog emphasizes OneTable's power in simplifying operations, fostering a more flexible and integrated data ecosystem. By breaking down data silos, OneTable unlocks new possibilities for data-driven insights and innovation. Additionally, I express gratitude to OneHouse for open-sourcing OneTable, a commendable contribution to the open-source community and a step towards addressing interoperability challenges. The complete hands-on lab exercise files and additional insights can be accessed through the provided link, offering a comprehensive resource for enhancing data management practices.


Read More

November 16, 2023

UPSERT Performance Evaluation of Hudi 0.14 and Spark 3.4.1: Record Level Index vs. Global Bloom & Global Simple Indexes
UPSERT Performance Evaluation of Hudi 0.14 and Spark 3.4.1: Record Level Index vs. Global Bloom & Global Simple Indexes

In my latest blog post, "UPSERT Performance Evaluation of Hudi 0.14 and Spark 3.4.1: Record Level Index vs. Global Bloom & Global Simple Indexes," I delve into the realm of data processing efficiency by conducting a comprehensive evaluation of indexing methods in Apache Hudi. Focusing on Record Level Index (RLI), Global Bloom, and Global Simple Indexes, all operating on Hudi 0.14 with Spark 3.4.1, the evaluation aims to provide insights into their performance metrics. The results showcase RLI's remarkable efficiency, outperforming Global Bloom and Global Simple Indexes in upsert operations. With RLI proving to be approximately 45.23% faster than Global Bloom and 25.19% faster than Global Simple Index, the blog underscores the pivotal role indexing methods play in optimizing data processing. The provided code snippets and detailed analysis offer valuable guidance for choosing the right indexing method tailored to specific application requirements. The complete code and additional insights can be explored in the accompanying exercise file, providing a comprehensive resource for enhancing data management and processing.


Read More

October 29, 2023

Accelerating Data Processing: Leveraging Apache Hudi with DynamoDB for Faster Commit Time Retrieval with Source Code
Accelerating Data Processing: Leveraging Apache Hudi with DynamoDB for Faster Commit Time Retrieval with Source Code

In my recent blog post, "Accelerating Data Processing: Leveraging Apache Hudi with DynamoDB for Faster Commit Time Retrieval," I address the challenges of large-scale data processing and introduce a powerful solution using Apache Hudi and DynamoDB. The focus is on enhancing commit time retrieval, crucial for various tasks such as point-in-time and incremental queries. The solution employs Hudi's HTTP callback feature to trigger a Lambda function, efficiently storing data in DynamoDB. This serverless approach ensures scalability, making it suitable for handling growing data volumes. The blog provides a detailed guide for setup and configuration, emphasizing the efficiency of this combination in simplifying downstream data processing and improving overall data operations. The complete code is available on GitHub for exploration.


Read More

October 14, 2023

Learn How to Ingest Data from PostgreSQL into Hudi Tables on S3 with Apache Flink and Flink PostgreSQL CDC Connector and Python
Learn How to Ingest Data from PostgreSQL into Hudi Tables on S3 with Apache Flink and Flink PostgreSQL CDC Connector and Python

In my recent blog post, "Learn How to Ingest Data from PostgreSQL into Hudi Tables on S3 with Apache Flink and Flink PostgreSQL CDC Connector," I guide readers through the process of efficiently ingesting data from PostgreSQL into Hudi tables on Amazon S3 using Apache Flink and the Flink PostgreSQL CDC Connector. The tutorial covers the prerequisites, including Docker, Apache Flink, Flink PostgreSQL CDC Connector, and Apache Hudi, followed by a step-by-step guide on setting up PostgreSQL, configuring the source with the Flink PostgreSQL CDC Connector, defining the Hudi sink for efficient data storage, and executing the Flink job for real-time data processing.


Read More

September 26, 2023

Getting Started with Apache Flink Python: Reading Data from Kinesis Stream Locally
Getting Started with Apache Flink Python: Reading Data from Kinesis Stream Locally

n my latest blog post, "Getting Started with Apache Flink in Python: Processing Kinesis Stream Data," I guide readers through the initial steps of working with Apache Flink. The hands-on lab explores how to set up Apache Flink, create a source table for a Kinesis stream, and read data from it using Python. The article covers the installation process, creation of Kinesis streams, publishing sample records, and the execution of Apache Flink Python code. It concludes with an overview of the PyFlink streaming application, emphasizing its potential for real-time data processing at scale. If you're interested in stream processing or data engineering, this blog provides valuable insights into harnessing Apache Flink's capabilities.


Read More

September 20, 2023

Apache Hudi: Accelerating UPSERT with a Simple Index and Choosing the Right Number of Buckets and Partitions for UUID-based Workloads
Apache Hudi: Accelerating UPSERT with a Simple Index and Choosing the Right Number of Buckets and Partitions for UUID-based Workloads

The LinkedIn article discusses the use of Apache Hudi to enhance upsert operations and simplify index selection. It highlights the importance of efficiently handling data updates and deletions in big data processing. The article provides insights into how Apache Hudi's features can improve data management, making it easier to choose and manage indexes for optimizing query performance in large-scale data applications


Read More

September 16, 2023

Mastering Event-Driven Data Ingestion: A Guide to Using Glue, SQS, and S3 Events | A Step by Step guide
Mastering Event-Driven Data Ingestion: A Guide to Using Glue, SQS, and S3 Events | A Step by Step guide

The LinkedIn article titled "Mastering Event-Driven Data Ingestion: A Guide using Glue" by Soumil Shah provides insights into effectively managing event-driven data ingestion using AWS Glue. The article delves into the concept of event-driven architecture, highlights the significance of real-time data processing, and showcases how AWS Glue can be leveraged to streamline and automate the ingestion process. It offers practical guidance and best practices for designing scalable and resilient data pipelines, ensuring efficient data extraction, transformation, and loading (ETL) operations. The article serves as a valuable resource for individuals seeking to enhance their understanding of event-driven data ingestion and its implementation using AWS Glue.


Read More

August 19, 2023

Mastering Event-Driven Data Ingestion: A Guide to Using Glue, SQS, and S3 Events | A Step by Step guide
Mastering Event-Driven Data Ingestion: A Guide to Using Glue, SQS, and S3 Events | A Step by Step guide

The LinkedIn article delves into mastering event-driven data ingestion using AWS Glue, offering a comprehensive guide to effectively implement event-driven data processing strategies. It emphasizes the use of AWS Glue to streamline data ingestion workflows and covers key principles, best practices, and practical insights for optimizing data pipelines in event-driven architectures.


Read More

August 19, 2023

Sending Weekly/Daily CSV|Excel Reports from Hudi Transactional Datalake to Customers via Email using Glue and SES
Sending Weekly/Daily CSV|Excel Reports from Hudi Transactional Datalake to Customers via Email using Glue and SES

The LinkedIn article "Sending Weekly/Daily CSV/Excel Reports from Hudi Datalake" authored by Soumil Shah provides insights into generating and sending regular CSV or Excel reports from a Hudi data lake. The article outlines a comprehensive approach to automate the process of creating and distributing these reports on a weekly or daily basis. It emphasizes the significance of Hudi as a data lake technology and demonstrates how to extract relevant data, transform it into CSV or Excel formats, and set up a scheduling mechanism for timely report delivery. This article serves as a practical guide for individuals aiming to establish a systematic and efficient workflow for report generation and distribution from a Hudi data lake.


Read More

August 13, 2023

Backfilling Apache Hudi Tables in Production: Techniques & Approaches Using AWS Glue by Job Target LLC
Backfilling Apache Hudi Tables in Production: Techniques & Approaches Using AWS Glue by Job Target LLC

The LinkedIn article titled "Backfilling Apache Hudi Tables in Production: Techniques and Approaches" authored by Soumil Shah offers valuable insights into effectively backfilling Apache Hudi tables within a production environment. The article comprehensively discusses the challenges and considerations associated with backfilling data in Hudi tables, emphasizing the importance of maintaining data consistency and accuracy. It presents various techniques and approaches to address backfilling scenarios, including incremental backfilling, snapshot-based backfilling, and merging strategies. The article also highlights the significance of planning and testing to ensure a smooth backfilling process without disrupting ongoing operations. This article serves as a practical guide for individuals seeking to implement successful backfilling strategies in production using Apache Hudi, offering a deeper understanding of the complexities involved and best practices to overcome them.


Read More

July 20, 2023

Building Incremental Join Workflows between Hudi Tables and DynamoDB in AWS Glue: A Beginner's Hands-on Guide
Building Incremental Join Workflows between Hudi Tables and DynamoDB in AWS Glue: A Beginner's Hands-on Guide

The article explains how to create incremental join workflows using Hudi tables. Hudi is a data management framework that helps handle large datasets efficiently. It discusses the challenges of incremental joins and provides step-by-step instructions with code examples on implementing these workflows using Hudi. The article emphasizes the benefits of using Hudi for faster processing and accurate data management.


Read More

June 5, 2023

Hands-On Lab: Unleashing Efficiency and Flexibility with Partial Updates in Apache Hudi
Hands-On Lab: Unleashing Efficiency and Flexibility with Partial Updates in Apache Hudi

The blog explores the concept of partial execution in programming, which enables selective execution of specific portions of code. It highlights the advantages of this approach, including faster execution times, reduced resource consumption, and improved debugging capabilities. The author provides a practical demonstration through a hands-on lab using the Pytest framework in Python. By implementing partial execution techniques, programmers can enhance efficiency and flexibility in their coding practices.


Read More

May 18, 2023

Building Transaction in Apache Hudi Data Lake with Streaming ETL
Building Transaction in Apache Hudi Data Lake with Streaming ETL

This article discusses the process of building transactions in Apache Hudi data lake using streaming ETL. The author explains the benefits of using Apache Hudi, such as its ability to handle large-scale data lakes with a high degree of concurrency and its support for ACID transactions. The article then provides step-by-step instructions for building a streaming ETL pipeline using Apache Flink and Apache Hudi to ingest, transform, and write data to Hudi data lake. The author also covers the challenges and considerations that come with building such a pipeline, such as schema evolution and data consistency. Overall, this article is a useful guide for those looking to implement a transactional data lake using Apache Hudi and streaming ETL.


Read More

May 14, 2023

Revolutionizing Data Management: A Review of Hudi's Success Stories at Walmart, Uber, Grofers, and Robinhood
Revolutionizing Data Management: A Review of Hudi's Success Stories at Walmart, Uber, Grofers, and Robinhood

The article highlights the success stories of Apache Hudi in revolutionizing data management for organizations. Apache Hudi is an open-source data management framework that simplifies data processing and storage for large-scale distributed data systems. The article describes how Apache Hudi has been used by various organizations to improve their data management capabilities. It discusses specific use cases, such as data warehousing, data ingestion, and data quality control, where Apache Hudi has provided significant benefits. The article also highlights the key features of Apache Hudi, such as incremental data processing, efficient storage, and data versioning, that make it an ideal solution for modern data management needs. It also mentions the growing adoption of Apache Hudi among organizations, including Fortune 500 companies, due to its ease of use and flexibility. Overall, the article emphasizes the importance of efficient data management in today's data-driven world and how Apache Hudi is helping organizations achieve their data management goals. The success stories showcased in the article demonstrate the potential of Apache Hudi in improving data management processes for organizations of all sizes and industries.


Read More

May 12, 2023

LakeBoost:Maximizing Efficiency in Data Lake (Hudi) Glue ETL Jobs with a Templated Approach and Serverless Architecture with Source Code
LakeBoost:Maximizing Efficiency in Data Lake (Hudi) Glue ETL Jobs with a Templated Approach and Serverless Architecture with Source Code

The project LakeBoost aims to maximize the efficiency of a data lake by using Apache Hudi and AWS Glue ETL. The project is designed to improve the speed and accuracy of data ingestion, processing, and retrieval. Apache Hudi is used to manage the data lake by enabling incremental data updates and efficient storage. This allows for faster data ingestion and retrieval, as well as better data quality control. AWS Glue ETL is used to extract, transform, and load data into the data lake, making the data available for analysis. The project uses a combination of Apache Hudi and AWS Glue ETL to create an efficient and scalable data lake solution. The data lake can handle large volumes of data and provides fast access to the data for analysis. The project also includes monitoring and logging features to ensure the reliability and accuracy of the data. Overall, LakeBoost provides a comprehensive solution for organizations looking to improve the efficiency and reliability of their data lake infrastructure.


Read More

May 8, 2023

How to Perform Radius-Based Search using Spark and Haversine Formula for Large-Scale Geospatial Data
How to Perform Radius-Based Search using Spark and Haversine Formula for Large-Scale Geospatial Data

The project is focused on implementing a radius-based search using Spark and Haversine formula. The goal of this project is to find all the data points that fall within a specified radius of a given location. The project uses Spark to perform data processing and Haversine formula to calculate the distance between two geographical coordinates. The Haversine formula is a mathematical formula used to calculate the great-circle distance between two points on a sphere, such as the Earth. The project includes steps to read and preprocess the data, filter data points based on their proximity to the given location, and calculate the distance using the Haversine formula. The final output of the project is a list of data points that fall within the specified radius of the given location. Overall, the project provides a solution for performing radius-based searches using Spark and Haversine formula. This solution can be useful for a wide range of applications that require location-based data analysis, such as marketing, logistics, and real estate.


Read More

April 30, 2023

Efficiently Managing Ride and Late Arriving Tips Data with Incremental ETL using Apache Hudi : Step by Step Guide
Efficiently Managing Ride and Late Arriving Tips Data with Incremental ETL using Apache Hudi : Step by Step Guide

The article on LinkedIn titled "Efficiently Managing Ride-Late Arriving: Tips, Data and Incremental" discusses ways to manage ride-late arrivals using data and incremental improvements. The author explains how collecting and analyzing data can help identify the reasons for late arrivals and how implementing small changes over time can improve on-time performance. The article also suggests involving drivers in the improvement process and using technology to streamline communication and improve scheduling.


Read More

April 28, 2023

From Raw Data to Insights: Building a Lake House with Hudi and Star Schema | Step by Step guide
From Raw Data to Insights: Building a Lake House with Hudi and Star Schema | Step by Step guide

The article describes the process of building a data lakehouse with Apache Hudi and creating a star schema to provide insights from raw data. It discusses the advantages of using a lakehouse architecture, such as its ability to support both batch and real-time processing, and its integration with existing data ecosystems. The author then explains how to use Hudi to perform incremental data processing and how to design a star schema for the lakehouse. Finally, the article concludes with a discussion of the benefits of using a lakehouse architecture for analytics and insights.


Read More

April 28, 2023

Unlocking Incremental Data in PySpark: Extracting from JDBC Sources without Debezium or AWS DMS with CDC
Unlocking Incremental Data in PySpark: Extracting from JDBC Sources without Debezium or AWS DMS with CDC

"Unlocking Incremental Data in PySpark - Extracting from JDBC" by Soumil Shah explains how to extract new or updated data from a JDBC data source using PySpark's JDBC API and Window function. The article emphasizes the importance of incremental data extraction in big data projects and provides sample code to illustrate the process. Overall, the article offers a concise guide to extracting incremental data from a JDBC data source using PySpark.


Read More

April 19, 2023

Step-by-Step Guide to Incrementally Pulling Data from JDBC with Python and PySpark
Step-by-Step Guide to Incrementally Pulling Data from JDBC with Python and PySpark

The article provides a step-by-step guide to incrementally pull data from JDBC sources using Python. The process involves setting up a PostgreSQL database, installing and configuring the Python library for PostgreSQL, and writing Python code to retrieve data from the database. The author also discusses the process of using an incremental approach to retrieve data, which involves retrieving only new or updated records since the last time the data was retrieved. This can be done using a timestamp column and a parameter in the SQL query. Finally, the author provides some tips for optimizing the performance of the data retrieval process.


Read More

April 18, 2023

Serverless Data Engineering: How to Generate Parquet Files with AWS Lambda and Upload to S3
Serverless Data Engineering: How to Generate Parquet Files with AWS Lambda and Upload to S3

The article discusses how to implement serverless data engineering using AWS Lambda to generate Parquet files. The author explains the benefits of using Parquet files, such as efficient storage and faster processing, and provides step-by-step instructions for creating a Lambda function to generate Parquet files from JSON data. The author also covers how to configure AWS S3 buckets to store the Parquet files and how to trigger the Lambda function using AWS EventBridge. Overall, the article provides a helpful guide for those looking to implement serverless data engineering with Parquet files on AWS Lambda.


Read More

April 15, 2023

Power Down Stream Relational Database Aurora Postgres from Apache Hudi Transactional Data Lake with CDC| Step by Step Guide
Power Down Stream Relational Database Aurora Postgres from Apache Hudi Transactional Data Lake with CDC| Step by Step Guide

Let's explore how to create an incremental processing pipeline that powers downstream applications and systems from a transactional data lake. This will demonstrate how to use incremental batch processing to feed an Aurora Posgres SQL database from Hudi. With the help of incremental batch processing, we will load the CDC events into Aurora landing from transactional datalake. From there, we'll DEDUP the data, clean it up, and enter the staging area. This way you are only brining data you need for operational purposes and you are saving of money on cost as PetaBytes Scale lake is on S3 which also gives you high availability. And further you can power your analytical workload by querying Petabyte scale datalake from S3 using either Athena or Redshift Spectrum


Read More

March 8, 2023

Architecture Powering Down Stream System with CDC from HUDI Transactional Datalake
Architecture Powering Down Stream System with CDC from HUDI Transactional Datalake

This architecture uses Incremental query with Change Data capture, which was recently announced in HUDI Version 0.13 and allows to capture changes in HUDI Transactional Datalake and power downstream systems whether it be search applications like elastic search or relational databases (Postgres, MySQL), non-relational databases (DynamoDB, MongoDB), relational databases (Postgres, MySQL), or simply to explore the data and run adhoc or running olap query in redshift spectrum.


Read More

March 1, 2023

Stream Changes Straight from DynamoDB to Hudi with Kinesis and Flink
Stream Changes Straight from DynamoDB to Hudi with Kinesis and Flink

This article discusses how to stream changes from DynamoDB to Apache Hudi data lake using Kinesis and Flink. The author explains the benefits of using this approach, such as the ability to handle high write rates and the support for incremental updates to Hudi data lake. The article then provides step-by-step instructions for building a pipeline using AWS Kinesis Data Streams to stream data changes from DynamoDB to Apache Flink for transformation and finally to Apache Hudi data lake for storage. The author also covers the challenges and considerations that come with building such a pipeline, such as handling schema changes and ensuring data consistency. Overall, this article is a useful guide for those looking to implement a real-time data streaming pipeline from DynamoDB to Apache Hudi data lake using Kinesis and Flink.


Read More

January 14, 2023

Build a Production-Ready Real-time Transaction Data Lake with Apache Hudi
Build a Production-Ready Real-time Transaction Data Lake with Apache Hudi

This article discusses how to build a production-ready real-time transaction data lake using Apache Hudi. The author explains the benefits of using Apache Hudi, such as its support for ACID transactions and the ability to handle large-scale data lakes with a high degree of concurrency. The article then provides step-by-step instructions for building a real-time data pipeline using Apache Kafka and Apache Flink to ingest, transform, and write data to Apache Hudi data lake. The author also covers the challenges and considerations that come with building such a pipeline, such as handling schema evolution and ensuring data consistency. Overall, this article is a useful guide for those looking to build a transactional data lake with Apache Hudi for real-time data processing.


Read More

December 16, 2022

Going Multi-Region for 0% Downtime High Availability Event-Driven Architecture
Going Multi-Region for 0% Downtime High Availability Event-Driven Architecture

This article discusses how to achieve high availability in an event-driven architecture by going multi-region with zero downtime. The author explains the benefits of multi-region architecture, such as improved availability, better disaster recovery, and reduced latency. The article then provides a detailed guide for building a multi-region architecture using AWS services like Route 53, CloudFront, and API Gateway. The author also covers the challenges and considerations that come with building a multi-region architecture, such as handling data replication and ensuring consistency across regions. Overall, this article is a helpful guide for those looking to implement a multi-region architecture for high availability in their event-driven applications.


Read More

November 6, 2022

Evaluating Which Python library is Best suitable for Bulk insert into Aurora Postgres SQL | Speed Comparison
Evaluating Which Python library is Best suitable for Bulk insert into Aurora Postgres SQL | Speed Comparison

Amazon Aurora PostgreSQL is a fully managed, PostgreSQL–compatible, and ACID–compliant relational database engine that combines the speed, reliability, and manageability of Amazon Aurora with the simplicity and cost-effectiveness of open-source databases. Aurora PostgreSQL is a drop-in replacement for PostgreSQL and makes it simple and cost-effective to set up, operate, and scale your new and existing PostgreSQL deployments, thus freeing you to focus on your business and applications. In this I wanted to test various python libraries for speed and performance and present the findings with community


Read More

November 5, 2022

When your Glue ETL processes fail, how to get email alerts Detailed instructions with a code
When your Glue ETL processes fail, how to get email alerts Detailed instructions with a code

Whenever the state changes for glue job that event shall be sent to AWS Event Bridge and where we shall have a rule if the event matches with given rule then event shall be passed to Lambda function which will process and send details to SNS Topic where subscribed candidates can be notified via email


Read More

November 5, 2022

Smart Way to Capture, Monitor, and Report Status for Python Jobs Using DynamoDB Single Table Design And Real Time Alerts
Smart Way to Capture, Monitor, and Report Status for Python Jobs Using DynamoDB Single Table Design And Real Time Alerts

In this article, I will present a solution that will allow you to easily monitor and capture status for running jobs and tasks. Capturing the details allows us to determine how long a process takes, what the status of the process is, and if necessary, dive into Task level details. When a job runs, it generates a unique process (GGUID), which represents the running or ongoing work. The process will have a start and end time and will display the status of ongoing activities. Each task in the process will have a name, a start and end time, and a status. If a task fails, the process status will be marked as failed. If a user needs more visibility for a function, they can simply log the function with decorator and all details will be captured in dynamodb for that task. I will demonstrate how to design and implement these solutions.


Read More

October 16, 2022

A Journey for Semantics Search with Elastic Search (80M) vectors Search (1.4TB)
A Journey for Semantics Search with Elastic Search (80M) vectors Search (1.4TB)

We store and search over 100 million items in elastic search. To achieve nearly 80X faster search, my team and I optimized an elastic search cluster, including selecting the appropriate instance size, fine-tuning shards and replicas, and node and shard level caching. In less than 1.5 seconds, we can now search through 100 million items. We also created a semantic search engine, which allows users to search through massive amounts of data using natural queries rather than traditional keyword searches. In this article, we share some insights such as an ingestion pipeline and some best practices.


Read More

August 20, 2022

Event processing of data streams optimizing SQS processing and efficient end-user querying
Event processing of data streams optimizing SQS processing and efficient end-user querying

Real-time data is information that is available immediately after it is created and acquired. Rather than being stored, data is immediately forwarded to users and is immediately available — with no lag — which is critical for supporting live, in-the-moment decision-making and powering downstream consumers. In this project, we share some of the best practices that my team and I implemented, which helped us save $1,000 for the company on AWS costs. We also talk and discuss some best practices on AWS lambdas and SQS, such as long polling, to save money.


Read More

August 9, 2022

Serverless Data Enrichment Pipeline using Step Functions for Sourcer Product (1000 Request/Second)
Serverless Data Enrichment Pipeline using Step Functions for Sourcer Product (1000 Request/Second)

With JobTarget Sourcer you no longer need to bounce between resume sites, or even recruiting tools! Perform a single search across resume sources, identify candidates that fit your criteria, and then unlock their contact details to get in touch. Sourcer allows you to proactively seek out candidates that possess specific skills and experiences. We wanted a real-time pipeline that will enhance and enrich the data on elastic search. Numerous recruiters are constantly contacting job seekers, and we want to enhance the data as soon as a recruiter contacts a candidate so that it may be sent to other microservices that have subscribed to broadcast emails and establish connections between recruiter and job seeker. The system must be highly available, and strong error handling is required. They also want to view crucial data like how many candidates are updated and how many enrichments failed. This dashboard needs to refresh very instantly. Providing real-time updates on the state of enrichment.


Read More

July 29, 2022

Fully Automated Data Ingestion Pipeline (Ingest 1.2TB) To Elastic Search using AWS Step function and Lambda and Firehose
Fully Automated Data Ingestion Pipeline (Ingest 1.2TB) To Elastic Search using AWS Step function and Lambda and Firehose

We regularly receive many files, about 7800 GZ files. Each GZ file has around 100000 records. Each file must be read, and pre-processing must be completed. Because these files are large, processing them takes longer, resulting in a bottleneck. We have an average of 100 million records. This is a labor-intensive and time-consuming task. We used to have a large codebase that read the file, processed it, and then uploaded it in bulk to elastic search, which took 5-7 days. I didn't like how tedious and labor-intensive this operation was, so we decided to create a Fully Automated Pipeline that could load all of this data in a fraction of the time. Hence we chose serverless components


Read More

May 27, 2022

How we got 50X faster Speed for querying Data Lake using Athena Query & saved Thousands of dollars | Case Study
How we got 50X faster Speed for querying Data Lake using Athena Query & saved Thousands of dollars | Case Study

We ingest data from various sources such as MongoDB, DynamoDB, and SQL Server we have built an internal framework that can easily handle 1000+ jobs and easily scale up and down compute environment. Read More: https://www.linkedin.com/pulse/batch-frameworkan-internal-data-ingestion-framework-process-shah/. In this project and article we will share insights how we optimize data lakes and saved cost on 80% on Athena Costs. We will also list all resources if you are interested in reading further


Read More

April 11, 2022

Robust Architecture to populate Data from MongoDB in Real-Time Using Mongo Streams, Event Bridge, SQS Queue and Lambdas (Processing 20k Events Per Day
Robust Architecture to populate Data from MongoDB in Real-Time Using Mongo Streams, Event Bridge, SQS Queue and Lambdas (Processing 20k Events Per Day

This Project we build and developed a pipeline which bring data from Mongo DB in Near Real Time using Mongo DB Streams and AWS event Bridge. We are processing 20K Events per day with this architecture. In this article we will be showing you architecture and our approach to build this pipeline


Read More

March 22, 2022

Batch framework(An Internal Data Ingestion Framework that process 1TB of data in Month and run 200+ Jobs)
Batch framework(An Internal Data Ingestion Framework that process 1TB of data in Month and run 200+ Jobs)

Batch Framework is a fully scalable internal framework designed to run 1000+ jobs and can scale horizontally. Each job have ability to specify how much compute environment you need you can specify how many cores and RAM you need. When a job starts it creates a process and each process has many tasks if any task is failed it will mark the process as failed on SQL server tables.


Read More

March 8, 2022

aws bigdata
Elastic Search Performance Tuning and Optimization How We Got 80X Faster Searches a Case Study
Elastic Search Performance Tuning and Optimization How We Got 80X Faster Searches a Case Study

In elastic search, we store and search over 100 million items. My team and I were able to optimize an elastic search cluster, including selecting the appropriate instance size, fine-tuning shards and replicas, and node and shard level caching, to achieve nearly 80X faster search. We can now search through 100 million items in less than 1.5 seconds. In this project, we will share our thoughts on the cluster and the steps we took to achieve and reach this level.


Read More

February 15, 2022