Hi πŸ‘‹, I'm Bangash!
Nice to meet you.

Senior Data Solution Architect with 11+ years’ experience designing and optimizing scalable data solutions. Expert in ETL pipelines, big data processing, and cloud architectures (Talend, NiFi, Airflow, Informatica) across AWS, Azure, and GCP. Skilled in data warehousing (Star, Snowflake, Data Vault) and big data tools (Hadoop, Spark, Kafka, HDFS) for real-time streaming. Strong in data governance, ensuring quality, metadata management, and compliance (HIPAA, GDPR). Experienced in deploying ML models (Scikit-learn, TensorFlow, PyTorch) via Databricks. Proficient in data visualization (Tableau, Power BI, QuickSight, Plotly) to deliver insights. Adept in DevOps practices with Docker, Kubernetes, and CI/CD pipelines for efficient delivery.

Bangash portrait

Skills

A quick snapshot of my toolkit

🐍
Python
0%
πŸ›’οΈ
SQL
0%
βš™οΈ
Golang
0%
🐳
Docker
0%
🦜
Kafka
0%
🧩
Debezium
0%
⚑
Spark
0%
🧠
LangChain
0%
πŸ€—
HuggingFace
0%
πŸ“š
RAG
0%
πŸ“ˆ
Scikit-Learn
0%
☁️
AWS
0%
🌐
GCP
0%
πŸ–₯️
Azure
0%
πŸ—„οΈ
MySQL
0%
🐘
PostgreSQL
0%
πŸƒ
MongoDB
0%
❄️
Snowflake
0%
🧱
Databricks
0%
πŸ“Š
Tableau
0%
πŸ“ˆ
Power BI
0%
πŸ”„
Alteryx
0%
βš™οΈ
Talend
0%
⚑
FastAPI
0%

Experience

Data Solution Architect

ApTask
2022.10 - Present
Nice Logo

Technologies

Apache KafkaAWS (EC2, Lambda)Apache FlinkAWS KinesisGCPAzureAWSDelta LakeDatabricksSnowflake

Highlights

  • Architected and deployed real-time data streaming infrastructures using Apache Kafka, Apache Flink, and AWS Kinesis, enabling 99.9% uptime for data pipelines and improving supply chain visibility and operational responsiveness by 35%.
    β€’ Designed and implemented scalable, cloud-native data architectures across AWS, Azure, and GCP, integrating Amazon Redshift, Google BigQuery, Azure Data Lake, Databricks Lakehouse, and Snowflake, leading to a 50% reduction in infrastructure costs and enhanced performance elasticity.
    β€’ Led end-to-end migration of on-premise data warehousesto modern cloud ecosystems, leveraging Snowflake, Delta Lake, and Databricks, resulting in 60% improvement in query performance and 70% decrease in maintenance overhead.
    β€’ Engineered and optimized over 250 ETL/ELT workflows using Talend, Apache NiFi, Apache Airflow, and Informatica, automating ingestion from MongoDB, flat files, and global systems in structured and semi-structured formats.
    β€’ Developed and deployed predictive analytics models in Python, Apache Spark, and MLlib, increasing demand forecast accuracy by 25% and reducing inventory stockouts by 20%. β€’ Built and maintained CMS-related data pipelines (MOR, MMR, MAO) using GCP Dataflow and BigQuery to support healthcare compliance reporting.
    β€’ Established and enforced enterprise-wide data governance policies, ensuring 100% compliance with GDPR and HIPAA, while implementing data lineage tracking, access control, and sensitive data masking using tools like Apache Atlas and AWS Lake Formation.

Senior Data Engineer

Petra Power
2019.08 - 2022.09
MTE Logo

Technologies

MLflowDatabricks Apache Airflow Apache NiFi Amazon S3 HBaseHDFSHiveKafka Spark Hadoop A/B TestingTime Series Modeling

Highlights

  • Developed and optimized large-scale data pipelines using Hadoop, Spark, Kafka, and Hive, significantly improving data processing speeds. β€’ Experience implementing data contracts and aligning with Data Mesh principles for decentralized ownership across distributed teams.
    β€’ Designed and implemented distributed storage solutions using HDFS, HBase, and Amazon S3, enhancing data accessibility and fault tolerance.
    β€’ Automated ETL workflows with Apache NiFi and Airflow, ensuring seamless data ingestion and transformation across cloud platforms.
    β€’ Collaborated with data science teams to deploy and maintain machine learning models on Databricks and MLflow, improving predictive capabilities.
    β€’ Implemented advanced performance tuning techniques for Apache Spark, reducing query execution times and improving scalability.
    β€’ Established data governance frameworks, enforcing data privacy, data lineage tracking, and industry regulation compliance.
    β€’ Led cross-functional teams in building real-time analytics platforms, delivering actionable insights to executives and stakeholders.

Data Engineer

Flatiron Health
2017.06 - 2019.07
Nice Logo

Technologies

REST APIsGreat Expectations PythonApache Beam Google Cloud Dataflow

Highlights

  • Designed and optimized scalable ETL pipelines using Apache Beam, Python, and Google Cloud Dataflow, supporting high-volume data processing.
    β€’ Architected and implemented a data lake on Google Cloud Platform (GCP), enhancing data accessibility and cross-functional analytics.
    β€’ Developed and automated data quality validation frameworks using Great Expectations, reducing data discrepancies by 40%.
    β€’ Standardized data models and schema designs to improve reporting consistency and reduce redundancy across business units.
    β€’ Integrated data from diverse sources, including REST APIs, flat files, and cloud databases, streamlining ingestion workflows and reducing delivery time by 30%.
    β€’ Collaborated closely with analytics and engineering teams to support real-time data access, increasing operational efficiency.
    β€’ Played a key role in enabling scalable healthcare data infrastructure to support precision oncology research and analytics.

Projects

SafeStreets

Data Solution Architect

Regulatory complianceAWS Apache Airflow Apache NiFi Apache Kafka
HIPAA-compliant data pipelines

Designed and led the development of a real-time healthcare analytics platform integrating EHR and claims data using Apache Kafka, Apache Flink, and AWS Kinesis.
Enabled predictive insights for population health management and reduced data processing latency by 60%.
Deployed HIPAA-compliant data pipelines with Apache NiFi and Airflow on AWS, enhancing care quality and regulatory compliance.

FinSight

Cloud Data Lakehouse Migration

MLflowMachine learning models TalendETL workflows Microsoft Azure Delta Lake Databricks

Led the migration of legacy on-premises data infrastructure to a unified cloud-based lakehouse using Databricks and Delta Lake on Azure.
Streamlined ETL workflows using Apache Spark and Talend, improving data refresh rates by 70%.
Integrated machine learning models with MLflow to forecast energy demands, increasing predictive accuracy by 30%.

DocuQuery

Financial Data Pipeline Modernization

Microsoft AzureData quality checks Data validation Data architecture Cloud-native data lake

Developed scalable ETL pipelines with Apache Beam, Python, and Google Cloud Dataflow, processing over 10 million financial records daily.
Designed a cloud-native data lake on GCP, enabling seamless access to structured and unstructured data for cross-team analytics.
Implemented automated data validation and quality checks using Great Expectations, reducing data inconsistencies by 40%.

Sue-per Bot

ML Feature Store for Fraud Detection

Real-time feature engineeringFraud detection Model iteration FeastMLflowDatabricks

Designed and deployed a centralized ML Feature Store using Databricks, MLflow, and Feast, enabling 3Γ— faster model iterations. Reduced fraud detection false positives by 18% through real-time feature engineering.