Data Engineer - Python

Colombo, Sri Lanka

Apply Now!

Job Title: Data Engineer

Employment Type: Full-Time

Location: Hybrid

About the Role:

We are seeking a Big Data Engineer with cloud experience to architect, develop, and maintain large-scale, cloud-native data platforms. The ideal candidate will have hands-on experience in distributed data processing, big data analytics, ML/AI workflows, and cloud infrastructure across AWS and Azure.

The role requires expertise in advanced analytics, machine learning, RAG, LLMs, AI agents, and full-stack data pipeline development. Frontend experience is a plus for building dashboards and visualization solutions.

Key Responsibilities:

Big Data & Analytics:
- Develop distributed ETL/ELT pipelines using Apache Spark, Dask, Databricks, Airflow, Polars, Pandas, and SQL.
- Perform high-performance analytics on large datasets, leveraging batch and streaming pipelines.
- Build data lakes, warehouses, and marts for real-time and batch analytics.

Web Scraping & Data Acquisition:
Develop robust web scraping pipelines using Selenium and Zyte for structured and semi-structured data ingestion.
Ensure scalable, automated, and reliable data collection from diverse web sources.

Machine Learning & AI:
- Deploy and maintain machine learning models and AI agents using SageMaker, Azure ML, Docker, ECS, and serverless pipelines.
- Implement RAG (Retrieval-Augmented Generation) workflows, LLM integration, and AI agent orchestration for advanced analytics.
- Collaborate with data scientists on feature engineering, model training, and production ML pipelines and use airflow to run pipelines

Cloud & Infrastructure:
- Design and maintain scalable, highly available cloud-native data platforms using AWS and Azure.
- Utilize AWS services: S3, EC2, ECS/Fargate, Lambda, Glue, Redshift, Athena, Kinesis, EMR, SageMaker, CloudFormation, CloudWatch, IAM.
- Utilize Azure services: Blob Storage, Data Lake, Synapse Analytics, Databricks, Event Hubs, Azure ML Studio, Functions, Key Vault, Monitor.
- Implement hybrid cloud architectures for secure, high-throughput data processing.

DevOps & Automation:
- Implement CI/CD pipelines, containerization, orchestration, and serverless workflows using Docker, Kubernetes, ECS/EKS, Terraform (Additional), CloudFormation, Serverless Framework.
- Automate monitoring, alerting, logging, and cost optimization using CloudWatch and monitor databricks
- Ensure data security, governance, and compliance across pipelines.

Frontend & Visualization (Optional/Additional):
- Build interactive dashboards using Power BI, Tableau, or React/NextJS.
- Enable self-service analytics for stakeholders.

Collaboration:
- Work closely with teams of engineers, data scientists, and analysts to deliver end-to-end cloud data solutions.
- Define and follow best practices in distributed computing, analytics, and ML workflows.

Required Skills:

Big Data & Distributed Computing: Apache Spark, Dask, Databricks, Airflow, Polars, Pandas, SQL, PostgreSQL.
Cloud Platforms: AWS (S3, EC2, ECS/Fargate, Lambda, Glue, Redshift, Athena, Kinesis, EMR, SageMaker) and Azure (Blob Storage, Data Lake, Synapse, Databricks, Event Hubs, ML Studio).
Programming & API Development: Python (advanced), Node.js, FastAPI.
DevOps & Infrastructure: Docker, ECS/EKS, Kubernetes, Serverless Framework, Terraform, CI/CD pipelines, monitoring/logging.
Machine Learning & AI: ML deployment, feature engineering, RAG, LLMs, AI agents, SageMaker/Azure ML integration.
Analytics & Visualization: Power BI, Tableau, Excel (advanced), optional React/NextJS.
Other: FTP/SFTP ingestion, ETL automation, high-performance computing, data governance, and security best practices.

Preferred Skills:

Experience in real-time streaming (Kafka, Kinesis, Event Hubs).
Expertise in high-performance analytics libraries (Polars, Vaex, Dask).
Knowledge of cloud cost optimization, security, and compliance frameworks.
Strong full-stack data application experience, integrating backend APIs with frontend dashboards.

Apply Now!