Skip to Content

Data Engineer - Python

Colombo, Sri Lanka

Job Title: Data Engineer

Employment Type: Full-Time

Location: Hybrid

About the Role:

We are seeking a Big Data Engineer with cloud experience to architect, develop, and maintain large-scale, cloud-native data platforms. The ideal candidate will have hands-on experience in distributed data processing, big data analytics, ML/AI workflows, and cloud infrastructure across AWS and Azure.

The role requires expertise in advanced analytics, machine learning, RAG, LLMs, AI agents, and full-stack data pipeline development. Frontend experience is a plus for building dashboards and visualization solutions.

Key Responsibilities:

  • Big Data & Analytics:
    • Develop distributed ETL/ELT pipelines using Apache Spark, Dask, Databricks, Airflow, Polars, Pandas, and SQL.
    • Perform high-performance analytics on large datasets, leveraging batch and streaming pipelines.
    • Build data lakes, warehouses, and marts for real-time and batch analytics.


  • Web Scraping & Data Acquisition:
  • Develop robust web scraping pipelines using Selenium and Zyte for structured and semi-structured data ingestion.
  • Ensure scalable, automated, and reliable data collection from diverse web sources.


  • Machine Learning & AI:
    • Deploy and maintain machine learning models and AI agents using SageMaker, Azure ML, Docker, ECS, and serverless pipelines.
    • Implement RAG (Retrieval-Augmented Generation) workflows, LLM integration, and AI agent orchestration for advanced analytics.
    • Collaborate with data scientists  on feature engineering, model training, and production ML pipelines and use airflow to run pipelines


  • Cloud & Infrastructure:
    • Design and maintain scalable, highly available cloud-native data platforms using AWS and Azure.
    • Utilize AWS services: S3, EC2, ECS/Fargate, Lambda, Glue, Redshift, Athena, Kinesis, EMR, SageMaker, CloudFormation, CloudWatch, IAM.
    • Utilize Azure services: Blob Storage, Data Lake, Synapse Analytics, Databricks, Event Hubs, Azure ML Studio, Functions, Key Vault, Monitor.
    • Implement hybrid cloud architectures for secure, high-throughput data processing.


  • DevOps & Automation:
    • Implement CI/CD pipelines, containerization, orchestration, and serverless workflows using Docker, Kubernetes, ECS/EKS, Terraform (Additional), CloudFormation, Serverless Framework.
    • Automate monitoring, alerting, logging, and cost optimization using CloudWatch and monitor databricks
    • Ensure data security, governance, and compliance across pipelines.


  • Frontend & Visualization (Optional/Additional):
    • Build interactive dashboards using Power BI, Tableau, or React/NextJS.
    • Enable self-service analytics for stakeholders.


  • Collaboration:
    • Work closely with teams of engineers, data scientists, and analysts to deliver end-to-end cloud data solutions.
    • Define and follow best practices in distributed computing, analytics, and ML workflows.

 

Required Skills:

  • Big Data & Distributed Computing: Apache Spark, Dask, Databricks, Airflow, Polars, Pandas, SQL, PostgreSQL.
  • Cloud Platforms: AWS (S3, EC2, ECS/Fargate, Lambda, Glue, Redshift, Athena, Kinesis, EMR, SageMaker) and Azure (Blob Storage, Data Lake, Synapse, Databricks, Event Hubs, ML Studio).
  • Programming & API Development: Python (advanced), Node.js, FastAPI.
  • DevOps & Infrastructure: Docker, ECS/EKS, Kubernetes, Serverless Framework, Terraform, CI/CD pipelines, monitoring/logging.
  • Machine Learning & AI: ML deployment, feature engineering, RAG, LLMs, AI agents, SageMaker/Azure ML integration.
  • Analytics & Visualization: Power BI, Tableau, Excel (advanced), optional React/NextJS.
  • Other: FTP/SFTP ingestion, ETL automation, high-performance computing, data governance, and security best practices.

 

Preferred Skills:

  • Experience in real-time streaming (Kafka, Kinesis, Event Hubs).
  • Expertise in high-performance analytics libraries (Polars, Vaex, Dask).
  • Knowledge of cloud cost optimization, security, and compliance frameworks.
  • Strong full-stack data application experience, integrating backend APIs with frontend dashboards.