Advancing Biological Research Through Transparent ML/AI Solutions: A Codeathon Experience

This past week, I had the incredible opportunity to join a group of talented individuals at the NCBI Codeathon, where we utilized machine learning and AI to address key challenges in biological research, achieving impressive results in just five days.

Author

Affiliation

Franziska Ahrend

Published

Mar. 03, 2024

DOI

This past week, I had the pleasure of participating in the NCBI's "Building Transparent ML/AI Solutions to Advance Biological Research Codeathon." The event was a whirlwind of innovation and collaboration, with the dedication from organizers like Alexa Salsbury, PhD, setting a remarkable tone. Over the course of just five days, teams harnessed machine learning and artificial intelligence to tackle various biological research challenges, each producing incredible projects.

SPARCLE Curation: Automating Protein Architecture Annotation

My focus during the codeathon was on the SPARCLE Curation project. Led by the brilliant minds of Marc Gwadz, PhD and Mingzhang Yang, PhD, our team aimed to assist a small group of experts who manually annotate subfamily protein architectures in the SPARCLE database. SPARCLE, the Subfamily Protein Architecture Labeling Engine [1], is a pivotal resource for the functional characterization and labeling of protein sequences based on their conserved domain architecture.

A domain architecture refers to the sequential order of conserved domains in a protein sequence. SPARCLE primarily relies on manual curation to add names and functional annotations. However, with over 200,000 uncurated protein architectures and only 42,000 manually curated, there is a significant need for automated processes to support and enhance these efforts.

To address this, our team trained a decision tree and an NLP model on the existing curated architectures. Our goal was to automate the assignment of labels to the remaining uncurated architectures. This approach leverages the curated data to predict names for related architectures, potentially streamlining the curation process and significantly reducing the workload for experts.

Research Questions and Scope

Our research centered around two primary questions:

RQ1: Can we predict a suitable name for related architectures based on their specific domain architectures (SpecificArch) and superfamily architectures (superfamilyarch) using the set of curated domain architecture names (CurName)?

RQ2: Will adding architecture title strings (TitleString) to the input matrix improve the prediction accuracy of curated names (CurName)?

Impact and Future Directions

The work done during this codeathon has the potential to transform the way protein architectures are curated. By automating parts of the process, we could free up valuable time for experts, allowing them to focus on more complex curation tasks and analyses. Although five days is a short time for such an ambitious project, our team leaders are planning to continue the work, noting that the codeathon provided a fantastic foundation to build upon. Moreover, this project highlights the incredible possibilities that arise when machine learning and AI are applied to biological research.

The collaboration, learning, and rapid development during the NCBI Codeathon were truly inspiring.

More details about SPARCLE and its applications, you can visit NCBI SPARCLE. or the Journal Article published in Nucleic Acids Research [1].

Further information

Additionally I learned also about other stunning biomedical applications and tools of Machine Learning. There is a comprehensive list of all the projects' GitHub repositories.

Thank you to the organizers, the team leaders, and my peers.

References

[1] Aron Marchler-Bauer, Yu Bo, Lianyi Han, Jane He, Christopher J. Lanczycki, Shennan Lu, Farideh Chitsaz, Myra K. Derbyshire, Renata C. Geer, Noreen R. Gonzales, Marc Gwadz, David I. Hurwitz, Fu Lu, Gabriele H. Marchler, James S. Song, Narmada Thanki, Zhouxi Wang, Roxanne A. Yamashita, Dachuan Zhang, Chanjuan Zheng, Lewis Y. Geer, Stephen H. Bryant, CDD/SPARCLE: functional classification of proteins via subfamily domain architectures, Nucleic Acids Research, Volume 45, Issue D1, January 2017, Pages D200–D203, https://doi.org/10.1093/nar/gkw1129