Title
Multi-Label Classification of Computer Science Research Papers using Papers’ Metadata
Abstract
In scientific literature, a publication is deemed to be a way of expression regarding scientific contribution in a specific context of a discipline. It can be further substantiated through a well-known quote that “Communication in science is realized through research publications”. Over the decades, the tremendous increase has been witnessed in the production of documents available in the digital form. The increased production of documents has gained so much momentum that their rate of production jumps two-fold every five years. The large chunk of these documents comprises of research publications due to the subsequent discoveries and inventions in science. This incessant process of research publications has never been interrupted on the contrary, it has gained significant momentum. Almost 28,100 active scholarly journals are publishing almost 2.5 million articles per year. These articles are searched over the Internet via search engines, digital libraries, and citation indexes. However, retrieval of relevant research papers for user queries is still a pipedream. This is due to the fact that scientific documents are not indexed based on some subject classification hierarchies such as ACM classification system for Computer Science. This has motivated researchers to propose innovative approaches for research papers classification. This is not only beneficial for relevant retrieval of research papers but also is helpful in many other application scenarios such as when: (1) journal/conference editors want to identify reviewers; (2) research scholar wishes to identify the suitable supervisor; (3) authors intend to submit their research papers; and (4) one seeks to analyze trends, find experts and to recommend relevant papers etc. In this dissertation, author has critically reviewed the literature on research papers classification and identified the following research deficiencies which have been focused in this dissertation: (1) The existing research papers’ classification schemes utilize content of papers and most of the time, non-availability of content make those schemes non-applicable. There is a need to explore some alternative features to classify research articles that could produce results closer to content based approaches. (2) Majority of state-of-the-art approaches focus on single-label classification, while experiments on comprehensive dataset revealed that a research article may belong to multiple categories. There is a need of such multi-label classification system that utilizes best possible alternate of the content based approaches with closer or improved accuracy. (3) The existing multi-label classification schemes classify citations into limited number of categories, In Computer Science domain; ACM classification system contains 11 classes at its root level. An approach that could classify research articles at least to the root level of ACM classification system is a need of the hour. The objective of this dissertation is to use freely available metadata in the best possible way to perform multi-label classification and to evaluate that; to what extent metadata based features can perform similar to content-based approaches? We have proposed, developed and evaluated techniques on metadata such as Title , Keywords, Title & Keywords, References of the research papers and have reported the achieved results. For classification of research articles based on metadata and into multi-labels, we have harnessed metadata in diverse ways for example: (1) Multi-label Document Classification using Papers’ Metadata (Title & Keywords); and (2) Multi-label Document Classification based on Research Articles’ References. These techniques have been evaluated for two different and diversified datasets. One dataset is from online journal known as Journal of Universal Computer Science (J.UCS) and other is benchmark dataset comprises of research papers published by the ACM. These techniques yield encouraging results (i.e. 88% of accuracy) by using only freely available metadata as compared to the state-of-the-art techniques on both datasets.