AI for Privacy Policies

AI for Privacy Policies

Image created by iuriimotov – www.freepik.com

As the world continues to transition into digitization, data remains vital in the continued evolution of the entire digital ecosystem particularly in improving access to service. With this comes the need to ensure protection of collected data and the privacy of individuals who provide it. Personal information collected about an individual may include name, address, email, phone number, age sex, marital status, race, nationality, religious beliefs and more depending on the context of the data required. [1]  In a complex environment where so much depends on data, protecting that information becomes increasingly important. In developing ways to protect the data of users, privacy policies hold central importance.[2] The protection of data and the right to privacy is protected through numerous legal instruments and policies existing internationally, regionally and nationally.

A Privacy Policy is a statement or a legal document that states how a company or website collects, handles and processes data of its customers and visitors. It explicitly describes whether that information is kept confidential, or is shared with or sold to third parties. By accepting the privacy policies presented under websites, which users often do blindly, a user agrees to release data under the conditions stated by the policy. To enable the user to make informed decisions, policies should be as complete as possible. However, Privacy Policies often fail to clearly, concisely and adequately inform users (as they find them too complex, long, difficult to understand) and as such users are never aware of what the policies actually cover and the aspects that are important to them.[3]

Attempts at using technology to simplify the reading process: breaking down the important parts and rating whether the privacy policies provide the desired protection threshold for a user, have been done, with a few notable successes, through the use of Artificial Intelligence.

In 2012 a group of researchers suggested using machine learning to assist the user by automatically evaluating the completeness of natural language privacy policies by providing a structured way to browse the policy content. A natural language privacy policy is a privacy policy written in natural language, a predominant method that operators of websites and online services use to communicate privacy practices to their users.[4] Natural language in this context refers to any language that has evolved naturally in humans through use and repetition without conscious planning or premeditation[5] or language that has been developed in the usual way as a method of communicating between people distinguished from language that has been create for example language for computers.[6]The ultimate output of the project being to develop: (i) an automatic completeness analyzer to verify which privacy categories are covered by a policy, (ii) a grading system to assess the level of completeness according to the categories covered, (iii) a mechanism allowing the user to browse only the text related to categories of interest and, (iv) an automatic privacy policy detector to trigger the policy analysis when a user visits a privacy policy web page. Completeness of a privacy policy under this research would be a privacy policy that addresses the most important privacy principles defined by the Organisation for Economic Co-operation and Development (OECD) and other privacy legislations. To assess the level of completeness of a policy, privacy categories are defined that the policy should cover based on privacy directives, regulations and common practice.  Text categorization and machine learning techniques are then used to check which paragraphs in the natural language privacy policy belong to which category, and grade the policy based on the categories covered. Once it’s detected that the user is viewing the privacy policy, a grade is provided to the user to give immediate information over the policy quality. The user can then inspect the policy in a structured way, selecting to look at only those paragraphs belonging to the categories the user  is interested in. The specific privacy categories used are ,[7]

  • Advertising – explains how the website manages advertisements (e.g. banners, sms, e-mail), and whether advertising is controlled by the website itself or by third parties.
  • Choice and Access (C&A) – provides information about the user’s privacy choices, such as opt-in/opt-out options, and user’s rights to access, modify and/or delete the information collected by the website.
  • Children – explains the company’s policy regarding the collection and use of personal information of children.
  • Collection – explains how and what kind of personal information may be collected by the website.
  • Cookies – explains whether the website makes use of cookies. It may also state the purpose(s) of using cookies, and the information the cookies store.
  • Location – explains how the website manages the user’s location information.
  • Retention – explains the purpose(s), and duration of the retention of personal data.
  • Safe Harbor – explains the website participation in, and self-compliance with, the U.S.-EU/Swiss Safe Harbor Framework.
  • Security – explains the security technologies applied by the website, e.g. use of SSL on the website or access control policies regulating employees’ practices.
  • Share – explains whether, and under which conditions, the website will share user’s information.
  • TRUSTe – explains whether the website has been awarded the TRUSTe’s Privacy Seal. This seal signifies that the website’s privacy policy and practices are compliant with the TRUSTe program’s requirements such as transparency, 4 accountability and user’s choice regarding collection and use of personal data.
  • External Links – warns the users about the fact that the current privacy policy does not cover third party websites reachable with external links.
  • Rights to view records – refers to the users’ right to request the access to the records of his personal information disclosed to third parties.
  • Processing – explains where the personal data is transferred to, stored and processed (the country where storing and processing take place impacts which regulations apply).
  • Policy Change – explains how updates to the privacy policy are managed, and whether and how the users will be informed of such changes.
  • Contact – provides company’s contact information, such as the registered office, or the address users can use for further questions or complaints.
  • Policy Change – explains how updates to the privacy policy are managed, and whether and how the users will be informed of such changes.
  • Contact – provides company’s contact information, such as the registered office, or the address users can use for further questions or complaints.

The final result, the completeness grade, only grades the contents of the policies based on the privacy categories, it does not give any guarantees on whether the policy is enforced by the website. Further, a high completeness grade only signifies that the policy covers the most of the categories. [8]

In September 2019, Javi Rameerez, a developer, built an app to gather a collective human subjective perspective on privacy. The  recurrent neural network-based app Guard[9] reads and analyzes privacy terms giving an  overall score and break down of  the main threats included in privacy policies and can also list the total number of threats and past privacy scandals if any. Rameerez relies on the recent work in Recurrent Neural Networks (RNN) to make a language model theoretically capable of telling apart privacy-friendly sentences from dangerous ones. For example, it can predict that “we sell your data” is a threat, whereas “we anonymize your data before it reaches us” is good for privacy. [10]The app is still in its experimental phase and is yet to be released for use, it has however been use to analyze privacy policies of commonly used websites such as twitter, Instagram, tinder, Mozilla, Netflix, Waze , LinkedIn, telegram, YouTube, Reddit, WhatsApp, Spotify, Tumbler, Mailchimp, Pinterest, Aliexpress, Shazam, Duolingo, Skyscanner and Change.org. [11]

Researchers at Switzerland’s Federal Institute of Technology at Lausanne (EPFL), the University of Wisconsin and the University of Michigan[12] also made attempts at using AI to read and analyse privacy policies. The researchers launched Polisis[13]—short for “privacy policy analysis”—a new website and browser extension that uses their machine-learning-trained app to automatically read and analyze any online service’s privacy policy. To build Polisis, the researchers trained their AI on a set of 115 privacy policies that had been analyzed and annotated in detail as well as 130,000 more privacy policies from apps on the Google Play Store. The annotated fine print allowed their software engine to learn how privacy policy language translated to simpler, more straightforward statements about data collection and sharing. The larger corpus of raw and privacy policies not yet interpreted supplemented that training by teaching the engine terms that didn’t appear in those 115 annotated ones by giving it enough examples to compare passages and find matching context.[14] The privacy-centric language model, built with 130,000 privacy policies, and a novel hierarchy of neural network classifiers that accounts for both high-level aspects and fine-grained details of privacy practices demonstrate Polisis’ modularity and utility. [15]Polisis AI can interpret a privacy policy with 88.4% percent accuracy, after those results are translated into broader statements about a service’s information collection practices.

Polisis treats a privacy policy as a list of semantically coherent segments (i.e., groups of consecutive sentences) and utilizes a taxonomy of privacy data practices Polisis is composed of three layers[16], the Application Layer, Data Layer, and Machine Learning (ML) Layer. The Application Layer provides fine-grained information about the privacy policy, providing users with high modularity in posing their queries. In this layer, a Query Module receives the User Query about a privacy policy. These inputs are forwarded to lower layers, which then extract the privacy classes embedded within the query and the policy’s segments. To resolve the user query, the Class-Comparison module identifies the segments with privacy classes matching those of the query. Then, it passes the matched segments (with their predicted classes) back to the application. The Data Layer first scrapes the policy’s webpage then, it partitions the policy into semantically coherent and adequately sized segments (using the segmented component in step). Each of the resulting segments can be independently consumed by both the humans and programming interfaces. The Machine Learning Layer describes the components of Polisis’ Machine Learning Layer in two stages: (1) an unsupervised stage, in which we build domain-specific word vectors (i.e., word embedding) for privacy policies from unlabeled data, and (2) a supervised stage, in which we train a novel hierarchy of privacy-text classifiers, based on neural networks, that leverages the word vectors. These classifiers power the Segment Classifier and Query Analyzer. Polisis uses word embedding and neural networks.[17]

Privacy policies are developed as a means for companies and service providers to inform users about their data collection and sharing practices, this information goes a long way in helping user’s make informed decisions as to the access they give in the use of the data they share. The difficulty in comprehending the content of the policies creates barriers in helping the users make informed decisions. The use of AI in addressing the barriers created through complexity in language and length as seen through the highlighted initiatives is likely to present an easier understanding of privacy policies and the implications of agreeing to the terms of privacy policies, consequently allowing users to make better informed decisions on the private data they give companies and service providers access to. Although these technologies are still in their initial stages, work is being done to refine how these technologies work and expand the scope of coverage to increase the percentage of accuracy. In more ways than one the use of AI in simplifying privacy policies plays a big role in ensuring that privacy policies serve the role intended in the protection of data.

[1] Maria P, ‘Privacy Policy.’ (Privacy Policies, 2020) <https://www.privacypolicies.com/blog/privacy-policy-template/#:~:text=Conclusion-,What%20is%20a%20Privacy%20Policy%3F,or%20sold%20to%20third%20parties.>

[2] Maria P, ‘Privacy Policy.’ (Privacy Policies, 2020) <https://www.privacypolicies.com/blog/privacy-policy-template/#:~:text=Conclusion-,What%20is%20a%20Privacy%20Policy%3F,or%20sold%20to%20third%20parties.>

[3] Elisa Costante, Milan Petkovic,Yuanhao Sun, Jerry den Hartog, ‘A Machine Learning Solution to Assess Privacy Policy Completeness.’ (Proceedings of the 2012 ACM workshop on Privacy in the Electronic Society, October, 2012). <https://www.researchgate.net/publication/262278232_A_machine_learning_solution_to_assess_privacy_policy_completeness/citations>

[4] Shomir Wilson, Florian Schaub, Aswarth Dara, Sushain K. Cherivirala , Sebastian Zimmeck , Mads Schaarup Andersen , Pedro Giovanni Leon , Eduard Hovy, Norman Sadeh ‘Demystifying Privacy Policies with Language Technologies: Progress and Challenges.’ (Text Analytics for Cybersecurity and Online Safety,2016 ) < https://www.ta-cos.org/sites/ta-cos.org/files/TA_COS_2016.pdf>

[5] John Lyons, Natural Language and Universal Grammar. (New York: Cambridge University Press 1991). < https://archive.org/details/naturallanguageu0000lyon/page/n5/mode/2up?q=Natural+language>

[6] Cambridge English Dictionary. <https://dictionary.cambridge.org/dictionary/english/natural-language> accessed 9 December 2020

[7] Elisa Costante, Milan Petkovic, Yuanhao Sun, Jerry den Hartog, ‘A Machine Learning Solution to Assess Privacy Policy Completeness.’ (Proceedings of the 2012 ACM workshop on Privacy in the Electronic Society, October, 2012). <https://www.researchgate.net/publication/262278232_A_machine_learning_solution_to_assess_privacy_policy_completeness/citations>

[8]  Elisa Costante, Milan Petkovic, Yuanhao Sun, Jerry den Hartog, ‘A Machine Learning Solution to Assess Privacy Policy Completeness.’ (Proceedings of the 2012 ACM workshop on Privacy in the Electronic Society, October, 2012). <https://www.researchgate.net/publication/262278232_A_machine_learning_solution_to_assess_privacy_policy_completeness/citations>

[9] https://useguard.com/

[10] https://useguard.com/experiment

[11] Results of these analyzed websites can be accessed on , https://useguard.com/

[12] Hamza Harkous, Kassem Fawaz, Remi Lebret, Florian Schaub, Kang G. Shin and Karl Aberer

[13] Polisis. <https://Polisis (pribot.org)>

[14] Andy Greenberg, ‘An AI That Reads Privacy Policies So That You Don’t Have To.’ (Wired, September 2018) <https://www.wired.com/story/polisis-ai-reads-privacy-policies-so-you-dont-have-to/>

[15] Hamza Harkous, Kassem Fawaz, Remi Lebret, Florian Schaub, Kang G. Shin and Karl Aberer, ‘Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning.’ (Polisis, 2018) <https://pribot.org/files/Polisis_Technical_Report.pdf>

[16] A layer is the highest-level building block in deep learning. A layer is a container that usually receives weighted input, transforms it with a set of mostly non-linear functions and then passes these values as output to the next layer. A layer is usually uniform, that is it only contains one type of activation function, pooling, convolution etc. so that it can be easily compared to other parts of the network. The first and last layers in a network are called input and output layers, respectively, and all layers in between are called hidden layers.

[17] Hamza Harkous, Kassem Fawaz, Remi Lebret, Florian Schaub, Kang G. Shin and Karl Aberer, ‘Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning.’ (Polisis, 2018) <https://pribot.org/files/Polisis_Technical_Report.pdf>

Leave a Comment

Your email address will not be published. Required fields are marked