Identification of Temporal Specificity and Focus Time Estimation in News Documents
Time is deemed as paramount aspect in Information Retrieval (IR) and it profoundly influence the interpretation as well as the users intention and expectation. The temporal patterns in a document or collection of documents plays a central role in the effectiveness of IR systems. The accurate discernment plays an immense role in persuading the time-based intention of a user. There exists a plethora of documents on the web wherein most on them contain the divergent temporal patterns. Assimilation of these temporal patterns in IR is referred to as Temporal Information Retrieval (TIR).
The comprehension of TIR systems is requisite to address the temporal intention of a user in an efficient manner. For time specific queries (i.e. query for an event), the relevant document must relate to the time period of the event. To attenuate the problem, the IR systems must: determine whether the document is temporal specific (i.e. focusing on single time period) and determine the focus time (to which the document content refers) of the documents.
This thesis exploits the temporal features of the news documents to improve the retrieval effectiveness of IR systems.As best to our knowledge, this thesis is the pioneer study that focuses on the problem of temporal specificity in news documents. This thesis defines and evaluate novel approaches to determine the temporal specificity in news documents. Thereafter, these approaches are utilized to classify news documents into three novel temporal classes. Furthermore, the study also considers 24 implicit temporal features of news documents to classify in to; a) High Temporal Specificity (HTS), b) Medium Temporal Specificity (MTS), and c) Low Temporal Specificity (LTS) classes. For such classification, Rule-based and Temporal Specificity Score (TSS) based classification approaches are proposed. In the former approach, news documents are classified using a proposed set of rules that are based on temporal features. The later approach classifies news documents based on a TSS score using the temporal features. The results of the proposed approaches are compared with four Machine Learning classification algorithms: Bayes Net, Support Vector Machine (SVM),Random Forest and Decision Tree. The outcomes of the study indicate that the proposed rule-based classifier outperforms the four algorithms by achieving 82% accuracy, whereas TSS classification achieves 77% accuracy.
In addition, to determine the focus time of news documents, the thesis contemplates the temporal nature of news documents. The type and structure of documents influence the performance of focus time detection methods. This thesis propose different splitting methods to split the news document into three logical sections by scrutinizing the inverted pyramid news paradigm. These methods include: the Paragraph based Method (PBM), the Words Based Method (WBM), the Sentence Based Method (SBM), and the Semantic Based Method (SeBM). Temporal expressions in each section are assigned weights using a linear regression model. Finally, a scoring function is used to calculate the temporal score for each time expression appearing in the document. Afterwards, these temporal expressions are ranked on the basis of their temporal score, where the most suitable expression appears on top. Two evaluation measures are used to evaluate the performance of proposed framework, a) precision score (P@1, P@2) and average error years. Precision score at position 1 (P@1) and position 2 (P@2) represent the correct estimation of focus at the top 2 positions in the ranked list of focus time whereas, average error year is the distance between the estimated year and the actual focus year of news document. The effectiveness of proposed method is evaluated on a diverse dataset of news related to popular events; the results revealed that the proposed splitting methods achieved an average error of less than 5.6 years, whereas the SeBM achieved a high precision score of 0.35 and 0.77 at positions 1 and 2 respectively.
The overall findings presented in this thesis demonstrate that the valuable temporal insights of documents can be used to enhance the performance of IR systems. The time aware information retrieval systems can adopt these findings to satisfy the user expectation for temporal queries.