To evaluate the effects of AI on data privacy it is important to define the idea of privacy as it relates to data and present the distinction between Public and Personal Data. Public data is defined as data that is freely accessible and can be reused and redistributed both domestically and internationally without any legal restriction [1]. Personal data can be thought of as any information that is potentially identifiable to a person, which that person can reasonably expect to be secure from public access [2].

Personal data can also be broken down further into two main types; namely non-sensitive personal data and sensitive personal data. The reason for this distinction is that there are stricter regulations about how sensitive personal data should be processed. Under GDPR, what was previously referred to as sensitive data is now referred to as special category data and includes any data about:

  • racial or ethnic origin
  • political beliefs
  • religious or philosophical beliefs
  • trade union membership
  • genetic or biometric data
  • physical or mental health
  • sex life or sexual orientation

(it is important to note that GDPR also distinguishes a separate category for criminal convictions but for the purposes of this article, we will refer to sensitive data to encompass both special category data as defined in GDPR legislation and also criminal convictions)

What is the Big Deal with Privacy?

There has been much debate around the issue of privacy and misuse of personal data in recent years. For the individual, there is a philosophical argument that privacy is a fundamental human right and affords us the autonomy to manage personal boundaries, to protect ourselves from unwarranted interference in our lives [3] and to fully express individuality.

Although it is essential to give up a degree of this autonomy to facilitate smooth interactions with both our physical and digital world, there is an inherent expectation that whenever we give up our personal information it will only be used to the degree to which it is necessary. For example, you don't expect, when you order a takeaway, your address to be made publically available as a result of your late-night culinary cravings.

However, when we are online it seems that this otherwise implicit expectation for other parties to respect our privacy needs to be made explicit. The laws around data privacy exist for this very reason and they are designed to support individuals in exercising their power in the digital realm - against entities who are more powerful [4] - where individuals don't have immediate control over what other social agents can do with their sensitive information.

Failing to keep sensitive personal data private means that malicious actors - whether individuals or organizations - could discriminate against someone, keep track of their physical whereabouts, steal their identity or gain access to their financial account amongst many other things. But why is this data so sought after anyway?

Why Data is so Valuable

With more and more of our daily interactions becoming digitised in recent years, there has been a huge increase in the supply of data generated from these digital interactions, both directly and indirectly. For example, supplying sensitive personal information such as credit card details during an online transaction is a form of directly supplying information about yourself. In the same transaction, you also divulge data about your shopping preferences through your searching and browsing activities, whilst looking for that perfect candle to buy for mothers day. All this data contains valuable information about the online habits and preferences of consumers (both as groups and individuals). Companies use this wealth of information to gain insights into how to better interact with consumers through customer profiling resulting in many people terming data as the new oil.

Customer profiling is defined as "a description of a customer, or set of customers, that includes demographic, geographic, and psychographic characteristics, as well as buying patterns, creditworthiness, and purchase history" [7]. In other words, data is used to paint a clear picture of a consumer, thus enabling businesses to more effectively target prospective audiences and innovate their offerings for existing customers. The more granular the customer profile the better. This is where the power of AI and data analytics is leveraged.

Using AI means that insights can be derived more powerfully, at a larger scale and from a larger variety of data. This all contributes to identifying more complete and intricate patterns within this data that would otherwise have been very difficult or too time-consuming for humans to find. Furthermore, a type of AI, known as recommender systems is often deployed to, either replace  or more typically, supplement human decision making. And often, combining AI with the power of cloud computing means that data processing and recommendations can happen in real-time. We see this most explicitly in e-commerce, such as when amazon suggests "things inspired by your shopping trends" for you to buy. And also in adTech systems, where the perfect advert for something you just searched on google always seems to pop up just a few seconds later, in your Instagram feed. How convenient, right?

The aim of these systems is to add value for customers by making their online experiences more streamlined, and for the most part, they are appreciated. However, the sentiment changes quickly to that of concern when users get uncanny feelings that perhaps an entity somewhere knows them just a bit too well. This raises concerns around privacy and misuse of personal data, and indeed these concerns are not unfounded.

Privacy, Confidentiality and Security...what is the Difference?

When It comes to concerns about the misuse of personal data, the three terms Privacy, Confidentiality and Security are often used interchangeably. However, each of these terms conveys a different concern. Let’s break down the nuances between these terms and highlight where they overlap.

Security refers to the procedural and technical measures required to prevent unauthorized access (or denial of access) to data stored or processed in a computer system. Whenever there is a breach of security it is also likely that there will be an accompanying confidentiality breach.

Confidentiality refers to the issue of sharing personal data of an individual to a third party without the individual's consent. Although it is not unheard of, it is atypical for hackers to break into a system just for the sake of testing their technical prowess. A hacker who has gained access to personal records will most likely be seeking to exploit this for personal gain in any way they can; whether financial or otherwise. Businesses can also profit from selling data to other companies who can benefit from its use, thus presenting another avenue through which confidentiality can also be breached. If a company sells the data records of individuals to a third party without the individual approval to do so, this is a confidentiality breach.

Privacy refers to the justification for collecting data in the first place and the justification for which collected data can be used for a secondary purpose[5]. Data protection laws go a long way to deter the misuse of personal data and in particular sensitive personal data. However, there are still ways in which privacy can be compromised even within the bounds of current legislation. To comply with regulations around how sensitive personal data should be handled organizations that collect personal data apply a technique known as data anonymization.

Data anonymization refers to the process of stripping or encrypting personal identifying information from sensitive data [8]. This means that data is no longer identifiable to an individual when inspected, thus reducing the risk of a privacy breach occurring when data is shared or if security a breach occurred.

The challenge posed is that data which is fully anonymised has limited analytical utility. So, oftentimes simpler techniques of data anonymisation, such as data suppression (the removal of selected information to protect individual identities and privacy[9]), are used as opposed to more complex techniques. The downside of taking simpler approaches to data-anonymization is that the data can be de-anonymized with very little effort. De-anonymization refers to the process of inferring individual identities from data that doesn't contain explicitly identifying information. This is typically done by cross-referencing the anonymized data sets with other data which could be sourced from other devices, applications, third parties or publically available data in order to identify personal information. AI offers the ability to gather, analyze, and combine vast quantities of data from different sources, and so leveraging AI for the problem of data-anonymization is a sensible solution.

How the applications of AI and Big data can compromise data privacy

In fact, the way in which AI can be used for data de-anonymization is not too dissimilar to the way in which AI is applied to data for customer profiling. When considered through this lens it is clear to see that there is a duality between using AI to solve the customer profiling problem and the data de-anonymization problems. Here are a few examples of how the AI and the big Data eco-system can compromise privacy.

  1. Psychological profiling

For instance, someone’s keyboard typing patterns can be used to deduce their emotional states such as nervousness, confidence, sadness, and anxiety. Even more alarming, a person’s political views, ethnic identity, sexual orientation, and even overall health can also be estimated from data such as activity logs, location data, and similar metrics. A famous example of physiological profiling being used for unethical purposes is the case of Cambridge Analytica, a company which gathered data from a Facebook personality survey that people were paid to take in order to profile them for targeted political advertising.

  1. Connected devices - (and internet of things)

A large proportion of modern internet data streams are enabled by connected remote devices otherwise known as the internet of things(IoT). Consumer products in this category range from smart speakers to doorbell cameras, and typically have poor security measures for numerous reasons. For example, their small size means that there is often a lack of physical security measures built into the devices themselves. Consequently, the trending adoption of IoT enabled devices in personal homes coupled with this increased security risk, is a huge privacy concern. Attackers can potentially hack these devices to collect data about what consumers do within their own homes.

  1. Voice recognition and facial recognition system

Voice recognition and facial recognition are two methods of identification that AI is becoming increasingly adept at executing. These methods have the potential to severely compromise anonymity in the public sphere. For example, it is incredibly difficult to anonymise that kind of data when the very nature of the data is unique to every individual. All this sort of data does not neatly fall into the remit of current regulations and so law enforcement agencies could use facial recognition and voice recognition to find individuals without probable cause or reasonable suspicion[10], ultimately bypassing legal procedures that they would otherwise have to go through.


Leveraging the power of AI to analyse large amounts of personal data can provide valuable insights for business to better understand the desires of their customers and therefore tailor products and services to their needs, which in turn makes businesses more profitable. What's more, even businesses that do not use data analytics directly can also profit from selling data to other companies who can benefit from its use. There-in lies clear incentives for businesses to find loopholes in data privacy legislation trading-off a potential loss of customer trust in the process. The pursuit of innovation is also another reason why businesses toe the line of how they can use customer data as much of the way in which technology companies innovate.

The use of personal data can also be extended to the domain of public services, such as healthcare and public safety. For example to predict things like potential drug shortages, predicting when someone may have a heart attack, help assist medical professionals in diagnosing ailments and predicting potential terror attacks. The plethora of data from all the online interactions leaves digital ‘fingerprints’ which we may not even be aware of, but can be picked up by AI.

Using personal data is okay if this data is adequately anonymised to comply with regulation and users have given permission. However, the application of AI to analyse this data makes it easier to both infer sensitive data and identify them to individuals even if this manner in which the data was processed is compliant with regulations. Permission agreements are often very long and overly complicated meaning users are not exactly sure what they have agreed to. A source of further concern is keeping this data secure and confidential so it doesn't fall into the hand of people who could potentially abuse it for malicious purposes.

Regulators who seek to mitigate these risks have been challenged with finding the balance between protecting citizens and creating a regulatory environment which fosters innovation. But, does the rate of technological developments mean that they will always be playing catch-up?


  2. data is information concerning, illustrative examples of private data.
  4. is important because%3A, time location data is private).