Introduction

There is no substitute for hard work and experience!

Hi,

Welcome to my blog. My name is Ansh. I am currently working with Asurion in a hybrid role, working on Data Platform/Infrastructure and Architecture, and looking at ways to incorporate Analytics (Exploratory/Predictive) internally to the Infrastructure Team to increase efficiency and look at ways to make the Data Platform pro-active as opposed to reactive. At present, 60% of my work revolves around Leading the Apache Airflow (open-sourced by Airbnb) effort for the Data Warehouse team and the remaining 40% of my work is about adding a face to the Data Platform leading to Data Democracy.

Current personal project: I am trying my hands at Angular (JavaScript) to build front-end Web application to my algorithmic trading bot (Python).

Working with Asurion, I started out with building a Proof-of-Concept on Outlier Detection (Unsupervised Learning) on ETL Operation data with attributes like Size of Data loaded, Amount of Rows Processed, Duration, Start Time, End Time; like all data projects I soon realized the missing component was more Quality data. We were limited by the  Orchestrator and ETL Tools being used currently and technical debt which definitely did not make it easy to collect more data. That is when I came Apache Airflow as an Orchestrator.

As I started out learning more about Airflow, like all tools, it did not completely check all the boxes that we were looking to tick. But Airflow being Python based, we knew we could customize it according to our needs. Our aim was to diverge as little as possible from the Open-Source version so that out Airflow Upgrade cycles are not drastically delayed.

Building for the team, the 3 main features implemented on Apache Airflow are:

  1. Collage [Canvas/Palette/Studio] –  a suite of Tools that annually saves ~ $70,000 USD
  2. Celery-Executor Auto Scaling workers – by default, the number of workers per queue are fixed.
  3. Integrate with internal Enterprise-wide Data Catalog

Collage is a suite of 3 tools, namely, Canvas, Palette and Studio.

Collage -> Canvas:

Canvas is a Parser that at its core parses a JSON and deploys an Airflow Dag. The thought process behind creating Canvas was, not everybody in the team might understand or be familiar with Python. And since Airflow Dags are Python files, how do we make sure the Dags are tested? How do we make sure the Dag code follows all coding standards and best practices?

Canvas essentially allows us to produce repeatable, modularised and well-tested Dags, at the same time offloading the responsibility from the Developer to learn/understand JSON Schema instead of maintaining and learning Airflow Python API. We made sure Canvas is unit and integration tested, thus reducing the effort to test Dags. This also allowed us to onboard DevOps sooner and drastically reduce Time-to-Production. Now for the interesting part, we know “There are no free lunches” there were trade-offs that we were willing to accept, we were giving up the flexibility to create Adhoc Dags. This was an acceptable trade-off for us as this could potentially lead to chaos (I agree it’s a strong word) as the it does get difficult to go through each Dag before deployment when one is dealing with ~100 Airflow Dags and  ~1500 Airflow Tasks in Production.

Collage -> Palette: write up on its way…

Collage -> Studio: write up on its way…

Celery-Executor Auto Scaling workers:

Allows Airflow Celery workers to scale in or scale out based on demand or workload, enabling cost and time savings. The most challenging piece for the project was building the Scale-in policies for Airflow worker nodes. The challenge is about deciding which worker to drain and how to drain. The need to make sure not to kill/remove a worker that still has Running Jobs and making sure to Scale in only when the worker is done processing its jobs.

Integrate with internal Enterprise-wide Data Catalog: write up on its way…

My time at Asurion has been productive, participating in Regional/Global Hackathons that we’ve won as a team:

  • 2021 – Winner  Regional Hackathon
  • 2020 – Runner-Up  Regional Hackathon
  • 2018 – Winner – Global Hackathon
  • 2018 – Winner – Regional Hackathon

I graduated from The George Washington University, USA (2018) with a MS Data Science with a GPA of 3.97, on a Tuition Fellowship. Completed my undergraduate degree in Computer Science & Engineering with a Minor in Mathematics.

Over about 4 years of work experience in Data and Software, I am curious about building Data Platform and analyze what each dataset has to offer and enjoy exploring and analyzing them through various tools – Python, AWS, Apache Airflow, Apache Spark, Elasticsearch, Kibana. I have a knack for picking up tools and getting up to speed with them by building projects to get hands-on.

One of my projects was published by NBC News [link] : @TerrorCases (twitter bot)

The top-right menu has links to my social media handles (Linkedin, Github, Twitter, Instagram).

Calligraphy and Philately is what keeps me busy when I’m not working. If you’re interested in grabbing a cup of coffee or talking about anything related or otherwise, definitely do drop me an email at:
anshgandhi16@gmail.com

It would be great to connect.