Research

PhD Research

The COVID-19 pandemic profoundly impacted me, inspiring my research journey. I realized that an efficiently trained language model for COVID-19 could rapidly analyze data and offer valuable insights for vaccine development. Motivated by this idea, my research focuses on optimizing language models for specialized domains such as healthcare. My long term vision is to leverage these models for scientific discoveries.

Under this bigger umbrella, I developed methods in synthetic data generation with meta-learning based feedback mechanisms, continual learning in dynamic environments, and the security/safety of large language models. My PhD thesis is available here: [PDF].

Watermarking in LLMs

With the proliferation of LLMs in generating synthetic datasets, distinguishing between human-curated and machine-generated texts is crucial to avoid misinformation. This distinction is particularly vital in specialized domains such as healthcare, where the authenticity and reliability of data are of utmost importance.

Token-Specific Watermarking with Enhanced Detectability and Semantic Coherence for Large Language Models

Mingjia Huo*, Sai Ashish Somayajula*, Youwei Liang, Ruisi Zhang, Farinaz Koushanfar, Pengtao Xie

International Conference on Machine Learning (ICML), 2024

pdf / code

A multi-objective optimization-based token-specific watermarking method to study and improve both watermark detectability and generation quality.

Continual Learning of Language Models

In dynamic and evolving data environments, language models must adapt to new data without losing accuracy on prior data. We explore how to efficiently select a sub-network that can be fine-tuned on new data while retaining prior knowledge.

Generalizable and Stable Finetuning of Pretrained Language Models on Low-Resource Texts

Sai Ashish Somayajula, Youwei Liang, Li Zhang, Abhishek Singh, Pengtao Xie

Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024

pdf / code

A bi-level optimization-based approach to finetune an automatically chosen sub-network within pre-trained language models on low-resource datasets to mitigate overfitting and reduce standard deviation.

Parameter Efficient Fine-tuning Methods

With the scaling of model parameters, fine-tuning becomes highly expensive in computation. PEFT methods become invaluable in such situations, and we explored whether the optimal LoRA rank can be learned for each downstream task.

AutoLoRA: Automatically Tuning Matrix Ranks in Low-Rank Adaptation Based on Meta Learning

Ruiyi Zhang, Rushi Qiang, Sai Ashish Somayajula, Pengtao Xie

Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2024

pdf / code

AutoLoRA is a meta-learning framework designed to automatically determine the optimal rank for each low-rank adaptation matrix.

Synthetic Data Generation

In specialized domains, labeled training data is often limited, especially in cases involving emergent diseases where timely and extensive data collection poses significant challenges. I explored whether downstream model feedback can improve generated data and whether gradients of unseen tokens can be synthesized in a task-driven optimization.

Improving Long COVID-Related Text Classification: A Novel End-to-End Domain-Adaptive Paraphrasing Framework

Sai Ashish Somayajula, Onkar Litake, Youwei Liang, Ramtin Hosseini, Shamim Nemati, David O. Wilson, Robert N. Weinreb, Atul Malhotra, Pengtao Xie

Scientific Reports, Nature Portfolio, 2024

Introduces medical paraphrasing to augment data, coupled with a feedback mechanism based on data reweighting and a meta-weight-network.

Bi-level Finetuning with Task-dependent Similarity Structure for Low-resource Training

Sai Ashish Somayajula, Lifeng Jin, Linfeng Song, Haitao Mi, Dong Yu

Findings of the Association for Computational Linguistics (ACL), 2023

pdf / code

A bi-level optimization approach to synthesize gradients of unknown lexical information from known data, leveraging a task-dependent similarity matrix.

A Multi-Level Optimization Framework for End-to-End Text Augmentation

Sai Ashish Somayajula, Linfeng Song, Pengtao Xie

Transactions of the Association for Computational Linguistics (TACL), 2022

pdf / code / video

A data reweighting based domain adaptive feedback mechanism for end-to-end learning of text augmentation and classification models.