Guess Who? Combining Personal Data With Demographics Increases the Risk of Identity Theft

Max Pritchard

Guess Who? Identity theft?

“Does your person have blue eyes?” “Does your person have massive eyebrows?” “Is your person wearing a hat?” “Is your person Frank?”

I was playing Guess Who? with my youngest son yesterday. Guess Who? is a game I played as a child and I’ve always enjoyed its simplicity. If you’ve never played it before, it’s a two-player game of asking yes/no questions to identify a randomly selected character from a panel of twenty-four possible candidates before your opponent guesses yours.

How many questions do you need to ask before you can identify an individual?

When you’re looking at twenty-four candidates, the worst case scenario is twenty-two (assuming it’s not the person on your card) if you just ask “Is your person [name]?”.

Players are quick to understand that by asking questions that divide the candidates into two groups, such as “is your person female?”, you can reduce the candidate pool more quickly.

If every question halved the size of the candidate pool, then four or five closed (yes/no) questions should be all that’s required. However, Guess Who is designed to split the pool into two groups, one with five potential candidates, and the other nineteen. That’s why there are only five women and nineteen men, five people with blue eyes, five people with moustaches.

If you got lucky on your first question, you could experience the thrill of tapping closed all but five of the candidates leaving your opponent vulnerable to a lucky guess, or the knowledge that one or two more questions will give you the answer.

Identity Theft is Like a Game of Guess Who?
Guess Who? can teach us about the risks of identity theft through combining personal data

The characters in the Guess Who? population have changed over the years, but sticking to the games mathematical structure has led to accusations against the game publishers of sexism and racism by people concerned with realistic cultural representations of demographics.

Digital privacy invasion & data mapping

Guess Who? may not be an accurate representation of the proportion of a population who have facial hair, wear glasses or are capable of being assigned to one of only two gender groups based on looks alone, but the idea of certain characteristics making it easier to identify you is very much part of the modern digital world.

Take the UK postcode system, for example.

In (most of) the UK, geographic locations are commonly identified by a postcode. This is a two-part set of letters and digits separated by a space, designed in the 1950s to assist the post office in the sorting and delivery of mail. The first part of the postcode refers to a sizable conurbation or a part of London (for example MK is Milton Keynes, DA is Dartford, M is Manchester) with a one or two digit number identifying a district. After the first part of the postcode, there is an arbitrary second part with a numeral and two letters which can zero down to a small street or part of a street of postal addresses.

If someone knows my full postcode, then I am only one of a hundred or so possible individuals – and so it feels personal and I may be reticent to share that data. The postcode district (the first set of letters and numbers), identifies an area with an average population of around 25,000 people – and so I may feel safer sharing that.

However, the postcode district can be as small as 150 people (TR25 is St Martin’s in the Isles of Scilly). What this means is that where you live may determine how resistant you are to losing your privacy with people living in rural or low population areas being more vulnerable to identity loss through data combination using postcode as a pivot.

The Office of National Statistics (ONS) publishes anonymised statistics based on UK postcodes. To their credit, the Office of National Statistics states that they swap records between data sets and do other things to frustrate re-identification. I am certain that commercial concerns are less diligent in their care of my identity.

When police state “A man in his twenties from St Martin’s was arrested on suspicion of XXX”, that may satisfy current anonymity reporting conditions, but it may not prevent journalists, investigators, victims, vigilantes, insurance companies or lawyers from identifying that person uniquely and doing what they will with it. If you start seeing lots of ads in Facebook for criminal lawyers in your area, you may want to check on the other members of your household.

As the volume of personal identifiers and quasi-identifiers online increases, the herd protection of living in a crowd will decrease. If celebrities and the wealthy are drawn to low population areas for reasons of privacy, it could be that they are actually decreasing their resistance to this form of digital privacy invasion and thus increasing their risk of being the victim of crime – stalking, confidence tricks, robbery, phishing.

Using combined demographics to identify individuals

It struck me that Guess Who? is a fantastic model for the kind of market segmentation used in advertising – and also a model for data protection for individuals. How many open questions would you need to ask, and have answered accurately, if you wanted to identify an individual member of a country’s population? Or in other words, how many pieces of personal data, other than someone’s name, would you need to uniquely identify them?

In 2000, Professor Latanya Sweeney published a paper on a study of US census data from 1990. In the paper she describes how, by combining published healthcare demographics, which had apparently been anonymised, with census data, you could successfully re-identify individuals, in many cases from just a few pieces of personal data.

Professor Sweeney showed that the answers to just three questions could uniquely identify 53% of people in the US assuming the records were accurate. “When was your person born?”, “What gender is your person?”, “What town does your person live in?” When you use the five-digit ZIP code instead of town, that re-identification rate increases to 87%.

This finding raised some alarms in the US, particularly when Professor Sweeney, then a graduate student, famously picked out Governor William Weld’s medical information, diagnosis, prescriptions and the like, from a combination of public news reports, low-cost voter records and health statistics.

Most hospitals in the US provide anonymous data to the government, and also sell it to companies in the healthcare and pharmaceutical industries. Those organisations have it in their power to combine that data with other data sets. Data sharing and monetising that data is big business.

These data sets are not always reliable, and the apparent ease of combination of data in the case of William Weld may not apply to everyone in a given population, but the amount of personal data available about each of us online is only increasing, and access to machine learning techniques to simplify automated combination of records will only broaden.

Preserving anonymity in a digital world

“When was your person born?”
“What gender is your person?”
“What town does your person live in?”

In Internet chatroom slang, this used to be called A/S/L – a three letter code that was a shorthand request for someone’s age, sex and location. It’s been a long time since I’ve been in a chatroom, so it’s probably old hat now. As a parent, it is worrying to know how an honest answer to those questions could allow the recipient to uniquely identify someone.

We know that online predators of all types use privileged information to build rapport, win confidence, gain trust or otherwise manipulate others – whether the predator is an advertising firm, a stalker, a foreign power, our own government, or a paedophile.

How careful are we about giving out individual jigsaw pieces of our identities; our gender, age or date of birth, mobile phone and location? Odds are it’s not careful enough to protect our anonymity. Why do so many companies need to know my data of birth? And yet it is a required field in almost every online account I am asked to create in order to access a web service.

I booked a ferry the other day and they wanted to know my age, the age of my wife and of each of my children. Price comparison sites often have to ask for more granular information than they strictly need because not all of the sites they get prices from want exactly the same information – so they are compelled by their business strategy to collect a superset of personal information. I feel culturally compelled to be honest when signing up, and professionally inclined to lie.

Modern machine-learning techniques can identify patterns and connections with a high degree of confidence from public records, social media and other data sets and then… you don’t need to have a data breach in order for your information to become currency in the modern identity marketplace, but almost inevitably, that data will exit the control of one or more of those organisations. Even if they take my security seriously, that does not mean I should not take countermeasures against the risk of data theft.

Although someone’s name, age, gender and location are clearly pivotal – virtual identity fulcrums if you like – to connecting data records, it may be just as risky to share the fact that “I am a member of the local reading group”, if that reading group has a small membership. It is notionally easy for AI to create a digital equivalent to an incident room evidence board full of drawing pins and coloured string linking pieces of data.

We already use many of these techniques to map the public digital skin of multinational company networks for security purposes, but they could just as easily be used to pick out an individual’s identity as a digital thread, as it is to pick out a company trademark.

Is our personal data better protected under GDPR?

This is why, at least in my mind, the EU General Data Protection Regulation (GDPR) casts a wide net in terms of what it considers personal data. Taking someone’s name off a data record is not always sufficient to protect someone’s identity in the long-term, and thus their privacy. However, GDPR is also concerned about data accuracy and, in the majority of my online interactions, that collides with my inclination to lie about my details.

One of the hardest parts of my job is trying to assess the risk to a data subject of a hypothetical breach of their data, or identity theft. I may find someone’s mobile phone number of no particular value and assign it a low risk, but to someone stalking a victim, that number can be the tool they use to commit their crime of harassment to the huge detriment of the data subject.

I cannot begin to guess at the kind of harm a threat actor might commit with any scrap of personal information that they can get their hands on. Every field of information might be the key that links two discrete data sets and places someone in harm’s way. That means I need to protect every bit (literally) of personal data as if it represents a high risk to the data subject. When everything is a high priority, nothing is high priority.

This global game of Guess Who? is interesting. It could be that the board game could be an important tool in teaching people about the risks of identity theft through combining personal information. It may be a crucial training tool for members of GCHQ trying to identify terrorists from a population of privacy-sensitive voters. It may be time for the venerable game to have a cyber-remake.