HIT5 Q&A Bot is a local host question and answers bot, the questions and answers crawled from Reddit are pre-indexing into Elasticsearch cluster, communicate with the front end by REST API. Additionally, this bot implements the Word2Vec model to run the synonym search.
Questions and answers from social question-answering sites can be used to automatically answer questions from users. This study explores different options for the matching of user questions against existing question-answer pairs. We used the Elasticsearch search engine and experimented with matching against different parts of the data, as well as using synonym databases to extend our queries. Our conclusions were that matching against answer text might be beneficial, but that using synonyms makes the answer relevance significantly worse.
Word2vec is a tool for generating word embeddings. With a group of models, it is able to represent every word with a vector by learning from a corpus of text. Word vectors are well placed in the vector space such that words that share common contexts in the corpus are located in close to one another in the space[4]. After training this model with all datasets, we obtain respectively the top 1, top 2, and top 3 most similar words for each word in the corpus. As the similarity between a pair of words is based on their common contexts in the corpus, it is not exactly a pair of synonyms. Therefore, we only consider adjectives and nouns and generate their synonyms with the most similar words. And the results act as a reference for synonyms.