On the Effectiveness of Private Information Retrieval Protocols


Due to the exponential growth of information on the Internet, Web Search Engines (WSEs) have become indispensable for the e ective retrieval of information. In order to provide the results most relevant to the user, WSEs store the users’ pro le that may contain sensitive information including the users age, gender, health condition, personal interests, religious or political aliation, etc. However, this raises serious concerns for the privacy of the user since the identity of a user may get exposed and misused by third parties. To address the issue of privacy infringement while using WSE, researchers have proposed several techniques such as anonymizing networks, pro le obfuscation, private information retrieval (PIR) protocols. In the anonymizing network, the user’s query is forwarded to the WSE through a chain of routers. In pro le obfuscation technique, fake queries are forwarded with the user’s query in order to mislead the WSE. While one well-known solution to preserve privacy is a Private Information Retrieval (PIR) protocol called Useless User Pro le (UUP) that issues the queries via Peer nodes, thereby hiding user’s identity from the WSE. Despite the fact that the aforementioned methods improve the user privacy, yet some previous studies using a machine learning algorithm and user pro le show that an adverse WSE is able to break pro le obfuscation and anonymizing network methods. However, it is not clear if an adverse WSE is able to break UUP using machine learning techniques.

This thesis investigates the protection o ered by UUP. To evaluate the e ectiveness of UUP in privacy protection, we propose QuPiD (Query Pro le Distance) Attack. QuPiD Attack is a machine learning based attack that determines the distance between the user’s Pro le (web search history) and upcoming query using a novel feature vector. The proposed feature vector is composed of a set of numeric values of 10 major topics acquired from uClassify service. The results show that the proposed QuPiD attack associates more than 40% queries to the correct user with a precision of over 70%. Moreover, during the investigations, the proposed QuPiD attack behave unexpectedly in some cases, a ecting its precision and recall. Upon detail investigation, three reasons are found responsible for that behavior: (I) variable similarity score between incoming query and user pro le, (ii) lack of traces of incoming query in the user pro le, and (iii) presence of more than one matching user pro les. We call this behavior as ProQSim (Pro le Query Similarity) E ect.

Based on the results, it is concluded that UUP does not provide satisfactory protection to users. We, therefore, develop PEM (Privacy Exposure Measure), a technique that minimizes the privacy exposure of a user while using the PIR protocols. PEM assesses the similarity between the user’s pro le and query before posting to WSE and assists the user to avoid further privacy exposure. The privacy evaluation experiments with PEM (Privacy Exposure Measure) demonstrates that a user pro le created with a web search engine through PEM is 95.97% different as compare to the usual user’s pro le and thus o ers more privacy to the user even in the case of machine-learning attack for mpeT = 10%. (Maximum privacy exposure (mpeT) is the threshold value set by the user for exposure of his/her pro le to the web search engine). We have established empirically that our proposed privacy protection mechanisms (PEM) signi cantly improves performance with regard to preserving privacy. Finally, we conclude this thesis with our perspectives and point out some key future research issues in the areas of PIR protocols, adverse models, and our suggested module PEM.

Download full paper