Research

PhD Research

The COVID-19 pandemic profoundly impacted me, inspiring my research journey. I realized that an efficiently trained language model for COVID-19 could rapidly analyze data and offer valuable insights for vaccine development. Motivated by this idea, my research focuses on optimizing language models for specialized domains such as healthcare. My long term vision is to leverage these models for scientific discoveries.

Under this bigger umbrella, I developed methods in synthetic data generation with meta-learning based feedback mechanisms, continual learning in dynamic environments, and the security/safety of large language models. My PhD thesis is available here: [PDF].

Watermarking in LLMs

With the proliferation of LLMs in generating synthetic datasets, distinguishing between human-curated and machine-generated texts is crucial to avoid misinformation. This distinction is particularly vital in specialized domains such as healthcare, where the authenticity and reliability of data are of utmost importance.

Continual Learning of Language Models

In dynamic and evolving data environments, language models must adapt to new data without losing accuracy on prior data. We explore how to efficiently select a sub-network that can be fine-tuned on new data while retaining prior knowledge.

Parameter Efficient Fine-tuning Methods

With the scaling of model parameters, fine-tuning becomes highly expensive in computation. PEFT methods become invaluable in such situations, and we explored whether the optimal LoRA rank can be learned for each downstream task.

Synthetic Data Generation

In specialized domains, labeled training data is often limited, especially in cases involving emergent diseases where timely and extensive data collection poses significant challenges. I explored whether downstream model feedback can improve generated data and whether gradients of unseen tokens can be synthesized in a task-driven optimization.