Sai Ashish Somayajula
I am a PhD candidate in the ECE department at University of California, San Diego advised by Professor Pengtao Xie. The COVID-19 pandemic profoundly impacted me, inspiring my research journey. I realized that an efficiently trained language model for COVID-19 could rapidly analyze data and offer valuable insights for vaccine development, etc. Motivated by this idea, my research focuses on optimizing language models for healthcare applications. Training these models for healthcare involves unique challenges, such as limited labeled data, the need for models to adapt to the ever-evolving nature of diseases, and stringent privacy standards. My work aims to develop adaptable models that support clinical practice and medical research while ensuring compliance with regulatory requirements.
To achieve these goals, my primary research focuses on synthetic data generation, continual learning, and the safety of large language models (LLMs).
I obtained my Bachelor's degree in Electrical Engineering with a minor in Computer Science and Engineering from the Indian Institute of Technology, Hyderabad. I was fortunate to be advised by Professor Sumohana S. Channappayya and Professor Adity Siripuram. I secured the second-highest CGPA in the B.Tech program across all departments. I was twice awarded the academic excellence award.
Email /
CV /
Bio /
Google Scholar /
Linkedin /
Github
Feel free to reach out to discuss ideas—I'd be glad to connect and brainstorm together!
|
|
News
- May 2024: Our work 'Token-Specific Watermarking with Enhanced Detectability and Semantic Coherence for Large Language Models' has been accepted to ICML'24!
- March 2024: Two papers accepted to NAACL'24!
- January 2024: Happy to share that our paper "Improving image classification of gastrointestinal endoscopy using curriculum self-supervised learning" has been accepted to the journal - Scientific Reports, Nature Portfolio.
- November 2023: Delighted to announce that I have successfully advanced to candidacy following my qualification exam. Committee: Pengtao Xie, Nuno Vasconcelos, Siavash Mirarab, Julian McAuley. My presentation slides can be found here.
- November 2023: My work on adaptive data augmentation for long COVID-19 literature classification has been accepted to the journal - Scientific Reports, Nature Portfolio. I would like to thank the editor and the reviewers for their insightful feedback on our work.
- November 2023: Gave a talk on my research at the IEEE YP Graduate Seminar Series - Session 4. The keynote speaker was Dr. Rangarajan Sampath, Senior VP and Head of the Center for Innovation in Diagnostics (CID) at Siemens Healthineers. Thank you for sharing your insightful thoughts on how to overcome failure. The video can be found here.
More
- October 2023: TA'ing for the OG course, Statistical learning, ECE271A with Professor Nuno Vasconcelos!
- June 2023: Started research internship with Input Experience NLP team at Apple Park, Cupertino! It was great working with industry experts, Dr. Vivek Kumar Rangarajan, Leo Xu, Shivangi Mahto and Barada Acharya.
- June 2023: Achieved a 100% acceptance rating as a Teaching Assistant for the course ECE 208 - Computational Evolutionary Biology with Professor Siavash Mirarab, based on student feedback. Thank you, students!
- May 2023: Our work titled
Bi-level Finetuning with Task-dependent Similarity Structure for Low-resource Training
has been accepted at ACL,2023. Thank you Tencent for this opportunity!
- January 2023: Got my Masters degree, Yay!
- November 2022: Sucessfully passed my preliminary exam. Committee - Professor Pengtao Xie (Chair), Professor Behrouz Touri and Professor Florian Meyer.
- July 2022: Presented my work,
A Multi-Level Optimization Framework for End-to-End Text Augmentation, as an oral presentation at NAACL, 2022. The presentation video is available here.
- June 2022: Started research internship at Tecent AI at Bellevue! It was great working with amazing researchers,
Lifeng Jin,
Linfeng Song,
Haitao Mi, and
Dong Yu.
- April 2022: Our work titled
A Multi-Level Optimization Framework for End-to-End Text Augmentation has been accepted at Transactions of the Association for Computational Linguistics.
- January 2022: TA'ed for Deep Generative models with Professor Pengtao Xie.
- October 2021: I had the privilege of delivering a lecture to approximately 200 students. Please find the recording here.
- September 2021: I'm thrilled to announce that I am TA'ing for the highly acclaimed Linear Algebra course, instructed by Professor Piya Pal.
- May 2021: I mentored high school students through the ENLACE program on a project titled 'Deep Learning Algorithms for Disease Segmentation in Chest X-rays,' guiding them from machine learning fundamentals to the complexities of the UNET architecture. Their impressive work is detailed in the final report.
- November 2020: I have been elected as the PhD representative for the ECE Graduate Student Council.
- September 2020: Started my PhD at University of California, San Diego.
- June 2020: Graduated from IIT, Hyderabad.
- May 2020: Academic Excellence Award for highest CGPA.
- July 2018: Achieved Microsoft Azure Award in Engineering the Eye-2018 hackathon.
|
Professional Experience
- Apple, Research Scientist Intern, 2023.
- Tencent AI, Research Scientist Intern, 2022.
|
Invited Talks
- Indian Institute of Technology Hyderabad, October, 2024.
[Mentorship Slides]
[Research Slides]
- Arya Mazumdar Group, University of California San Diego, August, 2024.
[Slides]
- Statistical Visual Computing Group, University of California San Diego, May, 2024.
[Slides]
|
Foundational Models
- Foundational Model for DNA 3-D Structure Prediction:
Developing a foundational model to predict missing gene coordinates, improving the accuracy of reconstructing the 3-D structure of DNA by analyzing both sequence data and folding patterns.
|
Watermarking in LLMs
With the proliferation of LLMs in generating synthetic datasets, distinguishing between human-curated and machine-generated texts is crucial to avoid misinformation. This distinction is particularly vital in healthcare applications, where the authenticity and reliability of data are of utmost importance. Statistically watermarking LLM-generated text can reliably detect such content. However, prior works face difficulty in achieving both high detectability and semantic quality of the generated texts after watermarking. We explore ways to improve both simultaneously.
|
Continual Learning of Language Models
As diseases evolve and demographic data changes, language models must adapt to new data without losing accuracy on prior data. Traditional fine-tuning of LLMs on new, emerging data might lead to the model losing its prior knowledge of older data. We explore a research direction on how we can efficiently select a sub-network that, when fine-tuned on new data, will achieve maximum performance without significantly losing prior knowledge.
|
Parameter Efficient Fine-tuning (PEFT) methods
With the scaling of model parameters, such as transitioning from RoBERTa-large's 355 million parameters to GPT-3's staggering 175 billion parameters, fine-tuning becomes highly expensive in computation. PEFT methods become invaluable in such situations, within which LoRA is very effective. However, the LoRA method uses a predefined rank for each update matrix. We explore the research question of whether we can learn the optimal rank of these update matrices for a downstream task to improve performance.
|
Data-Augmentation
In healthcare, labeled training data is often limited—especially in cases involving emergent diseases, where timely and extensive data collection poses significant challenges. Data-augmentation methods can be invaluable in such scenarios. Within this, I explored two directions: 1) Can we leverage feedback from the downstream model to improve augmentation? 2) Can we synthesize the gradients of unseen words in the training dataset in a task-driven optimization without any external knowledge?
|
|
Improving Long COVID-Related Text Classification: A Novel End-to-End Domain-Adaptive Paraphrasing Framework
Sai Ashish Somayajula,
Onkar Litake,
Youwei Liang,
Ramtin Hosseini,
Shamim Nemati,
David O. Wilson,
Robert N. Weinreb,
Atul Malhotra,
Pengtao Xie
Scientific Reports. Nature Portfolio, 2024
Introduce medical paraphrasing to augment data, coupled with a feedback mechanism. This approach utilizes a data-reweighting-based multi-level optimization framework with a meta-weight-network to enhance the classification performance of long COVID literature.
|
|