Data Engineer - Python
Colombo,
Sri Lanka
Colombo,
Sri Lanka
Job Title: Data Engineer
Employment Type: Full-Time
Location: Hybrid
About the Role:
We are seeking a Big Data Engineer with cloud experience to architect, develop, and maintain large-scale, cloud-native data platforms. The ideal candidate will have hands-on experience in distributed data processing, big data analytics, ML/AI workflows, and cloud infrastructure across AWS and Azure.
The role requires expertise in advanced analytics, machine learning, RAG, LLMs, AI agents, and full-stack data pipeline development. Frontend experience is a plus for building dashboards and visualization solutions.
Key Responsibilities:
- Big Data & Analytics:
- Develop distributed ETL/ELT pipelines using Apache Spark, Dask, Databricks, Airflow, Polars, Pandas, and SQL.
- Perform high-performance analytics on large datasets, leveraging batch and streaming pipelines.
- Build data lakes, warehouses, and marts for real-time and batch analytics.
- Web Scraping & Data Acquisition:
- Develop robust web scraping pipelines using Selenium and Zyte for structured and semi-structured data ingestion.
- Ensure scalable, automated, and reliable data collection from diverse web sources.
- Machine Learning & AI:
- Deploy and maintain machine learning models and AI agents using SageMaker, Azure ML, Docker, ECS, and serverless pipelines.
- Implement RAG (Retrieval-Augmented Generation) workflows, LLM integration, and AI agent orchestration for advanced analytics.
- Collaborate with data scientists on feature engineering, model training, and production ML pipelines and use airflow to run pipelines
- Cloud & Infrastructure:
- Design and maintain scalable, highly available cloud-native data platforms using AWS and Azure.
- Utilize AWS services: S3, EC2, ECS/Fargate, Lambda, Glue, Redshift, Athena, Kinesis, EMR, SageMaker, CloudFormation, CloudWatch, IAM.
- Utilize Azure services: Blob Storage, Data Lake, Synapse Analytics, Databricks, Event Hubs, Azure ML Studio, Functions, Key Vault, Monitor.
- Implement hybrid cloud architectures for secure, high-throughput data processing.
- DevOps & Automation:
- Implement CI/CD pipelines, containerization, orchestration, and serverless workflows using Docker, Kubernetes, ECS/EKS, Terraform (Additional), CloudFormation, Serverless Framework.
- Automate monitoring, alerting, logging, and cost optimization using CloudWatch and monitor databricks
- Ensure data security, governance, and compliance across pipelines.
- Frontend & Visualization (Optional/Additional):
- Build interactive dashboards using Power BI, Tableau, or React/NextJS.
- Enable self-service analytics for stakeholders.
- Collaboration:
- Work closely with teams of engineers, data scientists, and analysts to deliver end-to-end cloud data solutions.
- Define and follow best practices in distributed computing, analytics, and ML workflows.
Required Skills:
- Big Data & Distributed Computing: Apache Spark, Dask, Databricks, Airflow, Polars, Pandas, SQL, PostgreSQL.
- Cloud Platforms: AWS (S3, EC2, ECS/Fargate, Lambda, Glue, Redshift, Athena, Kinesis, EMR, SageMaker) and Azure (Blob Storage, Data Lake, Synapse, Databricks, Event Hubs, ML Studio).
- Programming & API Development: Python (advanced), Node.js, FastAPI.
- DevOps & Infrastructure: Docker, ECS/EKS, Kubernetes, Serverless Framework, Terraform, CI/CD pipelines, monitoring/logging.
- Machine Learning & AI: ML deployment, feature engineering, RAG, LLMs, AI agents, SageMaker/Azure ML integration.
- Analytics & Visualization: Power BI, Tableau, Excel (advanced), optional React/NextJS.
- Other: FTP/SFTP ingestion, ETL automation, high-performance computing, data governance, and security best practices.
Preferred Skills:
- Experience in real-time streaming (Kafka, Kinesis, Event Hubs).
- Expertise in high-performance analytics libraries (Polars, Vaex, Dask).
- Knowledge of cloud cost optimization, security, and compliance frameworks.
- Strong full-stack data application experience, integrating backend APIs with frontend dashboards.