کلیدواژهها
|
Proteins
,
Task analysis
,
Computational modeling
,
Training
,
Protein engineering
,
Feature extraction
,
Computational efficiency
|
چکیده
|
Recently, pretrained representations have gained attention in various machine learning applications. These methods involve considerable computational costs for training the model, hence motivating alternative approaches for representation learning. We introduce TripletProt, a new approach for protein representation learning based on the Siamese neural networks. Representation learning of biological entities which capture essential features can alleviate most of the challenges associated with supervised learning in bioinformatics. The most important distinction of our proposed method is relying on the protein-protein interaction (PPI) network. The computational cost of the generated representations for any potential application is significantly lower than comparable methods since the length of the representations is significantly smaller than that in other approaches. TripletProt and in general Siamese Network offer great potentials for the protein informatics tasks and can be widely applied to similar tasks. We evaluate TripletProt comprehensively in protein functional annotation tasks including sub-cellular localization (14 categories) and gene ontology prediction (more than 2000 classes), which are both challenging multi-class multi-label classification machine learning problems. We compare the performance of TripletProt with the state-of-the-art approaches including recurrent language model-based approach (i.e., UniRep), as well as protein-protein interaction (PPI) network and sequence-based method (i.e., DeepGO).
|