Omar Khattab

Omar

okhattab@stanford.edu
Curriculum Vitae
Google Scholar
GitHub
Twitter

 

 

I’m a graduating CS Ph.D. candidate at Stanford NLP and a Research Scientist at Databricks. I’m interested in Natural Language Processing (NLP) at scale. My research creates models, algorithms, and programming abstractions for building reliable, transparent, and scalable NLP systems. This often takes the form of systems capable of retrieval and reasoning, which can leverage massive text corpora to craft knowledgeable responses efficiently and transparently.

I’m advised by Matei Zaharia and Christopher Potts. Before coming to Stanford, I got my B.S. in CS in May 2019 from Carnegie Mellon University Qatar, where I was supervised by Mohammad Hammoud. My Ph.D. has been generously supported by the Eltoukhy Family Graduate Fellowship and then the Apple Scholars in AI/ML PhD Fellowship.

I will start as an Assistant Professor at MIT EECS in Fall 2025.


Research

My research spans two overarching directions, consolidated in two influential open-source research systems.

      

Together, DSPy and ColBERT receive well over one million downloads per month and serve as the basis of research and applications at Google, Amazon, IBM, VMware, Databricks, Baidu, AliExpress, and dozens of startups.

I) Building Reliable AI Systems with Language Models

I built the DSPy framework, a programming model for expressing and automatically optimizing Natural Language Programs, i.e. sophisticated pipelines of language models, retrieval models, and other tools. In this line of work, my research develops:

Natural Language Programs and their abstractions & optimizers, as in DSPy (ICLR’24 Spotlight) and its predecessor Demonstrate–Search–Predict. This also includes state-of-the-art LM programs like STORM (NAACL’24), IReRa, and PATH and optimizers like MIPRO and BetterTogether.

Retrieval-based NLP Systems like ColBERT-QA (TACL’21), Baleen (NeurIPS’21 Spotlight), Hindsight (ICLR’22), and ARES (NAACL’24).

II) Developing Effective & Efficient Retrieval Models

I built the ColBERT retrieval model, which has been central to the development of the modern landscape of information retrieval. In this line of work, my research develops:

Retrieval Models like ColBERT (SIGIR’20), ColBERTv2 (NAACL’22), and UDAPDR (EMNLP’23).

Scalable Retrieval Infrastructure like PLAID (CIKM’22) and DeepImpact (SIGIR’21).


Papers

 

Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together
D Soylu, C Potts, O Khattab
Preprint 2024 | paper

Prompts as Auto-Optimized Training Hyperparameters: Training Best-in-Class IR Models from Scratch with 10 Gold Labels
J Xian, S Samuel, F Khoubsirat, R Pradeep, …, A Sil, C Potts, O Khattab
Preprint 2024 | paper

Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
K Opsahl-Ong, M Ryan, J Purtell, D Broman, C Potts, M Zaharia, O Khattab
Preprint 2024 | paper

Backtracing: Retrieving the Cause of the Query
R Wang, P Wirawarn, O Khattab, N Goodman, D Demszky
EACL Findings 2024 | paper

Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models
Y Shao, Y Jiang, T Kanell, P Xu, O Khattab, M Lam
NAACL 2024 | paper

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems
J Saad-Falcon, O Khattab, M Zaharia, C Potts
NAACL 2024 | paper

In-Context Learning for Extreme Multi-Label Classification
K D’Oosterlinck, O Khattab, F Remy, T Demeester, C Develder, C Potts
Preprint 2024 | paper

Building Efficient and Effective OpenQA Systems for Low-Resource Languages
E Budur, R Özçelik, D Soylu, O Khattab, T Güngör, C Potts
Preprint 2024 | paper

DSPy Assertions: Computational Constraints for Self-Refining Language Model Pipelines
A Singhvi, M Shetty, S Tan, C Potts, K Sen, M Zaharia, O Khattab
Preprint 2023 | paper

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
O Khattab, A Singhvi, P Maheshwari, Z Zhang, K Santhanam, S Vardhamanan, S Haq, A Sharma, T Joshi, H Moazam, H Miller, M Zaharia, C Potts
ICLR 2024 (Spotlight) | paper

Image and Data Mining in Reticular Chemistry Using GPT-4V
Z Zheng, Z He, O Khattab, N Rampal, M Zaharia, C Borgs, J Chayes, O Yaghi
Digital Discovery 2024 | paper

UDAPDR: Unsupervised Domain Adaptation via LLM Prompting and Distillation of Rerankers
J Saad-Falcon, O Khattab, K Santhanam, R Florian, M Franz, S Roukos, A Sil, M Sultan, C Potts
EMNLP 2023 | paper

Resources and Evaluations for Multi-Distribution Dense Information Retrieval
S Chatterjee, O Khattab, S Arora
SIGIR REML 2023 | paper

Moving Beyond Downstream Task Accuracy for Information Retrieval Benchmarking
K Santhanam, J Saad-Falcon, M Franz, O Khattab, A Sil, R Florian, S Roukos, A Sil, M Sultan, M Zaharia, C Potts
ACL 2023 Findings | paper

Holistic evaluation of language models
P Liang, R Bommasani, T Lee, D Tsipras, D Soylu, …, O Khattab, …, Y Zhang, Y Koreeda
TMLR 2023 | paper
Note: This is a multi-component, 50-author project. O Khattab directed the Information Retrieval evaluation.

Demonstrate-Search-Predict: Composing retrieval and language models for knowledge-intensive NLP
O. Khattab, K. Santhanam, X. Li, P. Liang, C. Potts, M. Zaharia
ArXiv 2022 | paper | code

PLAID: An Efficient Engine for Late Interaction Retrieval
K. Santhanam*, O. Khattab*, C. Potts, M. Zaharia
CIKM 2022 | paper | (* denotes co-first authors)

ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction
K. Santhanam*, O. Khattab*, J. Saad-Falcon, C. Potts, M. Zaharia
NAACL 2022 | paper | (* denotes co-first authors)

Introducing Neural Bag of Whole-Words with ColBERTer: Contextualized Late Interactions using Enhanced Reduction
S. Hofstätter, O. Khattab, S. Althammer, M. Sertkan, A. Hanbury
CIKM 2022 | paper

Hindsight: Posterior-guided Training of Retrievers for Improved Open-Ended Generation
A. Paranjape, O. Khattab, C. Potts, M. Zaharia, Christopher D. Manning
ICLR 2022 | preprint

Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval
O. Khattab, C. Potts, M. Zaharia
NeurIPS 2021 (Spotlight) | preprint | HoVer leaderboard entry

On the Opportunities and Risks of Foundation Models
Stanford’s Center for Research on Foundation Models (CRFM), with 113 co-authors
Contributions to: Systems, Modeling, and Reasoning & Search
ArXiv 2021 | paper

Relevance-guided Supervision for OpenQA with ColBERT
O. Khattab, C. Potts, M. Zaharia
TACL 2021 | paper

Learning Passage Impacts for Inverted Indexes
A. Mallia, O. Khattab, N. Tonellotto, T. Suel
SIGIR 2021 (short) | paper

ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT
O. Khattab and M. Zaharia
SIGIR 2020 | paper | code

Finding the Best of Both Worlds: Faster and More Robust Top-k Document Retrieval
O. Khattab, M. Hammoud, and T. Elsayed
SIGIR 2020 | paper

PolyHJ: A Polymorphic Main-Memory Hash Join Paradigm for Multi-Core Machines
O. Khattab, M. Hammoud, and O. Shekfeh
CIKM 2018 | paper | code

LA3: A Scalable Link- and Locality-Aware Linear Algebra-Based Graph Analytics System
Y. Ahmad, O. Khattab, A. Malik, A. Musleh, M. Hammoud, M. Kutlu, M. Shehata, T. Elsayed
VLDB 2018 | paper | code


Blog Posts

The Shift from Models to Compound AI Systems
M. Zaharia, O. Khattab, L. Chen, J. Q. Davis, H. Miller, C. Potts, J. Zou, M. Carbin, J. Frankle, N. Rao, A. Ghodsi
Berkeley Artificial Intelligence Research | post

A Guide to Large Language Model Abstractions
P. Y. Zhong, H. He, O. Khattab, C. Potts, M. Zaharia , H, Miller
Two Sigma Articles | post

Building Scalable, Explainable, and Adaptive NLP Models with Retrieval
O. Khattab, C. Potts, M. Zaharia
Stanford AI Lab (SAIL) blog | post

A moderate proposal for radically better AI-powered Web search. Stanford HAI blog.
O. Khattab, C. Potts, M. Zaharia
Stanford HAI blog | post

Last Update: Mar 2024