AI-Driven Big Data Engineer (PhD Required)
AI- Driven Big Data Engineer
Employment Type: Full-TimeLocation: Remote, Singapore
Level: Entry to Mid Level (PhD Required)
Bridge Cutting-Edge AI Research with Petabyte-Scale Data Systems
Pixalate is an online trust and safety platform that protects businesses, consumers and children from deceptive, fraudulent and non-compliant mobile, CTV apps and websites. We're seeking a PhD-level Big Data Engineer to revolutionize how AI transforms massive-scale data operations.
Our impact is real and measurable. Our software has uncovered:
- Gizmodo: An iCloud Feature Is Enabling a $65 Million Scam
- Washington Post: Your kids' apps are spying on them
- ProPublica: Porn, Piracy, Fraud: What Lurks Inside Google's Black Box Ad Empire
About the Role
Work at the intersection of big data and AI, where you'll develop intelligent, self-healing data systems processing trillions of data points daily. You'll have autonomy to pursue research in distributed ML systems and AI-enhanced data optimization, with your innovations deployed at unprecedented scale within months, not years.
This isn't traditional data engineering - you'll implement agentic AI for autonomous pipeline management, leverage LLMs for data quality assurance, and create ML-optimized architectures that redefine what's possible at petabyte scale.
Key Research Areas & Responsibilities
AI-Enhanced Data Infrastructure
- Design intelligent pipelines with autonomous optimization and self-healing capabilities using agentic AI
- Implement ML-driven anomaly detection for terabyte-scale datasets
Distributed Machine Learning at Scale
- Build distributed ML pipelines
- Develop real-time feature stores for billions of transactions
- Optimize feature engineering with AutoML and neural architecture search
Required Qualifications
Education & Research
- PhD in Computer Science, Data Science, or Distributed Systems (exceptional Master's with research experience considered)
- Published research or expertise in distributed computing, ML infrastructure, or stream processing
Technical Expertise
- Core Languages: Expert SQL (window functions, CTEs), Python (Pandas, Polars, PyArrow), Scala/Java
- Big Data Stack: Spark 3.5+, Flink, Kafka, Ray, Dask
- Storage & Orchestration: Delta Lake, Iceberg, Airflow, Dagster, Temporal
- Cloud Platforms: GCP (BigQuery, Dataflow, Vertex AI), AWS (EMR, SageMaker), Azure (Databricks)
- ML Systems: MLflow, Kubeflow, Feature Stores, Vector Databases, scikit-learn + search CV, H2O AutoML, auto-sklearn, GCP Vertex AI AutoML Tables
- Neural Architecture Search: KerasTuner, AutoKeras, Ray Tune, Optuna, PyTorch Lightning + Hydra
Research Skills
- Track record with 100TB+ datasets
- Experience with lakehouse architectures, streaming ML, and graph processing at scale
- Understanding of distributed systems theory and ML algorithm implementation
Preferred Qualifications
- Experience applying LLMs to data engineering challenges
- Ability to translate complex AutoML/NAS research into practical production workflows
- Hands-on project examples of feature engineering automation or NAS experiments
- Proven success in automating ML pipelines, from raw data to an optimized model architecture
- Contributions to Apache projects (Spark, Flink, Kafka)
- Knowledge of privacy-preserving techniques and data mesh architectures
What Makes This Role Unique
You'll work with one of the few truly petabyte-scale production datasets outside of major tech companies, with the freedom to experiment with cutting-edge approaches. Unlike traditional big data roles, you'll apply the latest AI research to fundamental data challenges - from using LLMs to understand data quality issues to implementing agentic systems that autonomously optimize and heal data pipelines.