Sagar Das

Professional Experience

Data Specialist | University of Maryland

Part-time (20 hrs/week) | College Park, MD | SEP 2023 - May 2025

Building an NLP pipeline to transform surveys into analytics-ready datasets, leveraging PyTorch, LangChain, and HuggingFace Transformers, helping 8 research analysts save 20+ hours collectively on manual data querying efforts daily

Deployed R Shiny applications on GCP using ShinyProxy, Docker, and Terraform for multi-user collaboration on internal research applications. Incorporated security controls (IAM, RBAC) alongside a Flask application gateway with Google OAuth and reverse-proxy SSL to ensure secure access.

Prepared a proof-of-concept using GCP Pub/Sub, Apache Beam, and BigQuery to process 20M+ daily clickstream events from Canvas ELMS, enabling instant classroom analytics for the academic leadership team via Apache Superset dashboards

Optimized AWS ETL workflows through root cause analysis, implementing incremental ingestion and advanced SQL techniques (CTEs, partitioning, indexing) that cut down processing time from 7 hrs to 4 hrs while ensuring data normalization for accurate reporting.

Senior Software Engineer | Tiger Analytics

Full Time Employment | Chennai, India | JUL 2021 - JUL 2023

Partnered with data architects to prototype and launch 2 enterprise solutions: Tiger Intelligent Data Express and Tiger Data Observability Framework. Scaled the serverless backend of the MVPs to a microservice architecture leveraging FastAPI and Docker to support 3X user growth.

Developed a metadata-driven ETL ingestion framework using Airbyte, Apache Airflow, AWS Glue, and Python, facilitating rapid ingestion from diverse enterprise data sources (CDC, streaming, batch, on-premise databases), reducing asset onboarding time from weeks to hours

Engineered an ACID-compliant Lakehouse solution using Apache Iceberg, AWS S3, Athena, and Redshift. Implemented slowly changing dimensions (SCD Type-2) and time travel capabilities to ensure historical data integrity and support analytical workloads across business verticals

Built a data quality tool with Great Expectations, Apache Spark, and Airflow, used across multiple data projects to validate 30+ custom checks on multiple big data file formats and datasets, reducing data anomalies by 60% across downstream BI and analytics pipelines

Implemented a real-time infrastructure observability pipeline utilizing AWS CloudWatch, ELK stack (Elasticsearch, Logstash, Kibana), and Grafana, accelerating root cause analysis and decreasing Mean-Time-To-Resolution (MTTR) for production issues by 40%

Established data lineage and governance capabilities through seamless integration of LinkedIn DataHub, providing detailed tracking, auditability, and visualization of data flow, aiding regulatory compliance (GDPR, CCPA) and trust in data assets

Collaborated with business analysts and sales teams to translate functional requirements into engineering solutions, while effectively communicating complex technical architectures through simplified presentations during client-facing discussions to build stakeholder engagement.

Took a break from DEC 2019 to JUN 2021 to focus on personal ambitions

Intern & Software Engineer | Xenonstack

Full Time Employment | Chandigarh, India | JAN 2019 - NOV 2019

Modernized a data platform by migrating legacy Hadoop workflows and Pig scripts to Scala/PySpark ETL jobs on Databricks, processing IoT sensor and weather data from 45 geo-locations via Kafka to create a high-availability data lake powering data science workflows

Collaborated with MLOPS team to build a Python framework using MLFlow to automate model management & scoring, improving feature engineering efficiency by 33% and reducing model discovery time in production

Scaled TensorFlow model training by implementing a Ray-based distributed pipeline across a 6-node cluster, reducing computation time by ~10%

Developed a cost-optimization strategy to bid and select EC2 spot instances for AWS EMR jobs during off-peak hours.

Selected Works

Data Fusion Engineering

Google Cloud Apache Spark Terraform Apache Superset BASH SQL

Developed a complete analytics solution on GCP to ingest, store, transform, and analyze data from 6 different Open NYC dataset APIs incrementally
Created automated ETL workflows and auto-updating dashboards to visualize KPIs, aiding city planning and identifying accident-prone areas

View on GitHub →

Intelligent Record Management

PyTorch NLP Tools Streamlit Elasticsearch Gemma2

A document processing and semantic search system for intelligent indexing of congressional archives
Allows users to input a query and retrieve relevant past press releases and document summaries using embeddings, NER, and keyword extraction

View on GitHub →

Loan Default Prediction System

PySpark Pandas Seaborn SciKit Learn XGBoost Random Forest K-Means Clustering Dimensionality Reduction Customer segmentation

Crafted an end-to-end machine learning pipeline to predict loan defaults incorporating data preprocessing, feature engineering, & supervised ML modeling
Prepared a borrower segmentation visualizer using K-Means clustering to identify high-risk defaulter profiles

View on GitHub →

Data Preparation for Fintech Analytics

AWS Python Great Expectations Postgres Tableau

Developed a 5-step AWS Glue and Great Expectations pipeline for metadata extraction, profiling, validation, and transformation.
Built an event-driven pipeline with AWS Lambda, Step Functions, and S3 for real-time quality checks and transformations.
Integrated cleaned datasets into AWS RDS and created Tableau dashboards to visualize data quality trends and anomalies.

View on GitHub →

Monitoring EKS Cluster

AWS Terraform HELM Prometheus Jenkins CI/CD Kubernetes

Streamlined the setup for deploying OpenTelemetry Webshop
Enabled modular deployment by splitting large Kubernetes manifests into component YAML files
Provisioned EKS, Grafana, and Prometheus with Terraform and Helm
Developed a system to collect logs from the kube-system and track unhealthy pods
Built pipelines to test and build Docker images via a remote Jenkins server

View on GitHub →

Sports Analytics System

Python Plotly Pandas Tableau

Developed EDA framework analyzing 10+ performance KPIs
Created interactive dashboards for tactical analysis
Identified 3 key success factors through Bayesian analysis

View on Kaggle →

SAGAR DAS

Open to offers

Professional Experience

Part-time (20 hrs/week) | College Park, MD | SEP 2023 - May 2025

Full Time Employment | Chennai, India | JUL 2021 - JUL 2023

Took a break from DEC 2019 to JUN 2021 to focus on personal ambitions

Full Time Employment | Chandigarh, India | JAN 2019 - NOV 2019

Educational background

University of Maryland - College Park | USA

Panjab University - Chandigarh | India

Core Technical Competencies

Data Engineering

Software Engineering

Data Science & MLOps

Generative AI

Selected Works

Data Fusion Engineering

Intelligent Record Management

Loan Default Prediction System

Data Preparation for Fintech Analytics

Monitoring EKS Cluster

Sports Analytics System