The role includes:
- Designing a web-to-profile pipeline with distributed crawling and deduplication
- Building LLM enrichment chains using RAG, LangChain, Semantic Kernel or LlamaIndex
- Architecting a multimodal knowledge base across Postgres, pgvector, Cosmos DB Graph, and DuckDB
- Integrating finance, trade, and product data from external sources
- Leading hands-on data ops across GitHub Actions, dbt, Airflow, and OpenLineage
- Experience with Python (typed, async), Spark or Dask
- Strong system design and data architecture experience at scale
- Deep understanding of LLM pipelines and prompt/tool orchestration
- Familiarity with Microsoft Fabric, Azure Blob/Data Lake, or similar platforms