TY - GEN
T1 - Heterogeneous Metric Learning with Content-Based Regularization for Software Artifact Retrieval
AU - Wu, Liang
AU - Du, Liang
AU - Liu, Bo
AU - Xu, Guandong
AU - Ge, Yong
AU - Fu, Yanjie
AU - Li, Jianhui
AU - Zhou, Yuanchun
AU - Xiong, Hui
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014/1/1
Y1 - 2014/1/1
N2 - The problem of software artifact retrieval has the goal to effectively locate software artifacts, such as a piece of source code, in a large code repository. This problem has been traditionally addressed through the textual query. In other words, information retrieval techniques will be exploited based on the textual similarity between queries and textual representation of software artifacts, which is generated by collecting words from comments, identifiers, and descriptions of programs. However, in addition to these semantic information, there are rich information embedded in source codes themselves. These source codes, if analyzed properly, can be a rich source for enhancing the efforts of software artifact retrieval. To this end, in this paper, we develop a feature extraction method on source codes. Specifically, this method can capture both the inherent information in the source codes and the semantic information hidden in the comments, descriptions, and identifiers of the source codes. Moreover, we design a heterogeneous metric learning approach, which allows to integrate code features and text features into the same latent semantic space. This, in turn, can help to measure the artifact similarity by exploiting the joint power of both code and text features. Finally, extensive experiments on real-world data show that the proposed method can help to improve the performances of software artifact retrieval with a significant margin.
AB - The problem of software artifact retrieval has the goal to effectively locate software artifacts, such as a piece of source code, in a large code repository. This problem has been traditionally addressed through the textual query. In other words, information retrieval techniques will be exploited based on the textual similarity between queries and textual representation of software artifacts, which is generated by collecting words from comments, identifiers, and descriptions of programs. However, in addition to these semantic information, there are rich information embedded in source codes themselves. These source codes, if analyzed properly, can be a rich source for enhancing the efforts of software artifact retrieval. To this end, in this paper, we develop a feature extraction method on source codes. Specifically, this method can capture both the inherent information in the source codes and the semantic information hidden in the comments, descriptions, and identifiers of the source codes. Moreover, we design a heterogeneous metric learning approach, which allows to integrate code features and text features into the same latent semantic space. This, in turn, can help to measure the artifact similarity by exploiting the joint power of both code and text features. Finally, extensive experiments on real-world data show that the proposed method can help to improve the performances of software artifact retrieval with a significant margin.
UR - http://www.scopus.com/inward/record.url?scp=84936938573&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84936938573&partnerID=8YFLogxK
U2 - 10.1109/ICDM.2014.147
DO - 10.1109/ICDM.2014.147
M3 - Conference contribution
AN - SCOPUS:84936938573
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 610
EP - 619
BT - Proceedings - 14th IEEE International Conference on Data Mining, ICDM 2014
A2 - Kumar, Ravi
A2 - Toivonen, Hannu
A2 - Pei, Jian
A2 - Zhexue Huang, Joshua
A2 - Wu, Xindong
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 14th IEEE International Conference on Data Mining, ICDM 2014
Y2 - 14 December 2014 through 17 December 2014
ER -