Received: 20 April 2017 Revised: 3 January 2018 Accepted: 21 January 2018
Query expansion based on statistical learning from code
State Key Laboratory of Software
Engineering, Computer School, Wuhan
University, Wuhan, China
College of Information Engineering of
North China University of Water
Resources and Electric Power, Zhengzhou,
Deepin Technologies Co Ltd, Wuhan,
Guoqing Wu, State Key Laboratory of
Software Engineering, Computer School,
Wuhan University, Wuhan, China.
National Natural Science Foundation of
China, Grant/Award Number: 61170022
Thesaurus-based, code-related, and software-specific query expansion tech-
niques are the main contributions in free-form query search. However, these
techniques still could not put the most relevant query result in the first posi-
tion because they lack the ability to infer the expansion words that represent the
user needs based on a given query. In this paper, we discover that code changes
can imply what users want and propose a novel query expansion technique with
code changes (QECC). It exploits (changes, contexts) pairs from changed meth-
ods. On the basis of statistical learning from pairs, it can infer code changes for a
given query. In this way, it expands a query with code changes and recommends
the query results that meet actual needs perfectly. In addition, we implement
InstaRec to perform QECC and evaluate it with 195 039 change commits from
GitHub and our code tracker. The results show that QECC can improve the pre-
cision of 3 code search algorithms (ie, IR, Portfolio, and VF) by up to 52% to
62% and outperform the state-of-the-art query expansion techniques (ie, query
expansion based on crowd knowledge and CodeHow) by 13% to 16% when the
top 1 result is inspected.
code changes, code search, information retrieval, software reuse, statistical learning, query
As code repositories (eg, CodePlex,
) become available,
code search has become a common
activity during software development.
Especially, users are more interested in the free-form query search, which allows
users to type natural language keywords to define queries.
The performance of this search strongly depends on word
matches between queries and query results. However, queries and query results do not often use the same words.
the length of a query is usually short. Sadowski et al reported that the average number of words per query is 1.85 for the
queries proposed to Google search.
Obviously, it is not an easy task to formulate a good query. This motivates the query
reformulates a query with synonyms in a word thesaurus. However, Lu et al
showed that the general English-based similarity measurements of WordNet could not effectively suggest similar words
Softw Pract Exper. 2018;48:1333–1351. wileyonlinelibrary.com/journal/spe Copyright © 2018 John Wiley & Sons, Ltd. 1333