Hashing: The best-practice encryption method

How are sensitive data kept private?

In a world where private data is becoming increasingly abundant, many data engineers will have to understand how to appropriately handle sensitive data. A dataset is sensitive when it contains information such as people's incomes, occupations, behaviour, physical or mental health, religious preferences, sexual orientation, wealth, security information, political views, etc. Other sensitive data can also be at the firm level, regarding firm size, revenue, suppliers and infrastructure information. 

These sensitive types of information is what government regulations such as the General Data Protection Regulation (GDPR) in Europe and the Protection of Personal Information Act (POPI) in South Africa target in their legislation. These legislative acts govern data privacy and consent by data subjects for their data to be collected, stored and processed. POPI is more about ensuring privacy when sensitive data are being collected and stored — communication is only a small part of that.

When private data are present in a dataset, it becomes necessary to hide fields that identify subjects, as those fields would link the sensitive data to individual subjects. Such identification fields include personal ID numbers, or detailed location information. Also, data that are not aggregated enough potentially makes it possible to identify where a subject lies in a dataset (so aggregation is one method to resolve data privacy issues).

The original data is usually needed for accurate analysis, but the identifying fields should be hidden by the data engineer who prepares the dataset, before the data are made available to analysts. Below, I outline how the identifying fields can be 'fudged' using a secure method, whilst still retaining a correspondence with the original groups.

How to handle sensitive data

Let us discuss how to encrypt data. 'Hashing' is when a programmer takes various input strings (a string is a sequence of text characters), including some security features such as a secret key, mixes it into a soup, and produces a predictable, encrypted code, using a cryptographic algorithm. These algorithms get used whenever secure communication happens over the internet. What if I told you (you being a data engineer) that you could simply apply the same technique to encrypt identifying fields in a sensitive dataset, so that data analysts can use that dataset? Read on to see how it's done!

I use Python to run this method. Specifically, I use the SHA512 cryptographic algorithm, using the hashlib module. We start off by defining the function, as follows. Note that you should store a unique, long string of characters in an environment variable, for the secret key.

''' Hashing sensitive data '''
import osimport timeimport hashlibimport hmacimport pandas as pd

_path = os.path.dirname(__file__)
def sign_ID(    timestamp,  # Unix milliseconds    username,   # Username of data scientist who is executing the code (to create an additional layer of security)    key_secret = os.environ.get('secret_key'),    body = ""   # ID-number and location information    ):    """    Hashes personally-identifiable information.    """    payload = "{}{}{}".format( timestamp, username.lower(), body )    message = bytearray(payload,'utf-8')        # soup:    signature = hmac.new(         bytearray(key_secret,'utf-8'),        message,         digestmod=hashlib.sha512    ).hexdigest()
    return signature

The algorithm produces a sort-of 'soup', or hash, of all your identifying fields put together, which can be used as a new field to differentiate the individual subjects and their observation-level groupings. We save the system time as a variable (so that the correspondence between the hash and the identifying data can be preserved), and apply the function:

# Putting it into practice:
initial_timestamp = int(time.time()*1000)
dataset = pd.read_csv('/work/project/data-raw/test-data.csv')
dataset['id_concat'] = dataset[        ['Location', 'SerialNo', 'ID']    ].astype(str).apply("-".join, axis=1)
dataset['hash_address'] = [  # list comprehension    sign_ID(        timestamp = initial_timestamp,        username = os.getlogin(),        body = x    )[:10] for x in dataset['id_concat']]

If a recipient of your data knows your secret key, they could reverse-engineer your hash. But, in the implementation above, they would also need to know the timestamp used. Unless they receive the data within milliseconds of you sending it, no-one will know the timestamp. I consider the timestamp local variable as one of the most critical security features in this function, as it is nearly impossible for anyone to find out at exactly what point in time the computer ran the script.

If you were to securely verify a connection to an API, you would send an entire hash signature, but in order to reduce space in your dataset, you only need to keep the first 10 or so characters. Now we can drop the fields that were hashed, and save the data as an anonymised dataset:

dataset = dataset.drop(['Location', 'SerialNo', 'ID', 'id_concat'], axis=1)
dataset.to_csv('/work/project/data-anon/test-data.csv')dataset.to_parquet('/work/project/data-anon/test-data.gzip')

Ready for Analysis

Now your data are ready to be distributed to data analysts. Just for clarity, here are examples of what random hashes of 10 characters look like:

f3d9bb2a27
422532b526
4e1e1e2201
0ef9d71f60
4ca5de574a
42a3ec07c2

The beauty of this method is that the hash can still be used to group the data into groups representing the fields that were concatenated, as there is a one-to-one correspondence, even though there is no way of knowing who the subjects are! I hope that you have learnt something new today about this cryptographic method, and that you will consider applying it to your management of sensitive datasets! I have shown a best-practice method of anonymising data, as the algorithm is strong, and the technique is nearly impossible to reverse engineer or hack.

Aidan Horn