1

I have tried a simple algorithm to anonymize the data using the de-identification technique. But the code doesn't work for me. I want to anonymize the data by slightly changing the values of strings and integers. The data sample is available here

This is what i have tried.

import pandas as pd 
import uuid as u 
import datetime as dt 
 # generate a pseudo-identifier sequesnce using python random number generator library uudi.

    def uudi_generator(length): 

    uudi_list= list() 
    i=0 
    while i < length: 
        uudi_list.append(u.uuid4()) 
    i+=1 
    return uudi_list 

#import original originaL dataset 
dataset = pd.read_csv('bankcredit-data.csv') 

# pseudo identifier
sLength = len(dataset['housing']) 
dataset.insert(0, 'uuid', pd.Series(uudi_generator(sLength), index=dataset.index)) 

# Transaction record attached to the original
dataset.insert(0, 'transaction_date', pd.Series([dt.datetime.now]*sLength, index=dataset.index)) 

#transcation record is attached to originaL data file 
dataset.to_csv('bankcredit-data.csv') 

#delete identifiabLe record from dataset 
del dataset['firstnamme'] 
del dataset['lastname'] 

# export  de-identified dataset as csv to be shared with the user
dataset.to_csv('deidentified-data.csv')
Muhammad Ali
  • 2,509
  • 5
  • 21
  • 22

3 Answers3

1

I don't have access to input dataset, I created sample on my own and tried your code with little modification and it worked

input dataset:
housing      lastname   firstname
64403818    AA  AB
30893205    AC  AD
89883627    AE  AF
90302087    AG  AH

After I executed, input dataset appended with uid and transaction_date

transaction_date    uuid                                     housing    lastname
10/31/2019 20:35    809b4505-2269-48b0-8833-e7502fc2738a    64403818    AA
10/31/2019 20:35    7de91a91-0b58-4703-b62b-4278efe22b05    30893205    AC
10/31/2019 20:35    d6b8cfbd-a9c2-4ffd-b336-0a23547445ea    89883627    AE
10/31/2019 20:35    11db6b3a-9679-4422-b754-4c1b23aa4801    90302087    AG


firstname
AB
AD
AF
AH

and the output dataset becomes
transaction_date    uuid                                    housing
10/31/2019 20:35    809b4505-2269-48b0-8833-e7502fc2738a    64403818
10/31/2019 20:35    7de91a91-0b58-4703-b62b-4278efe22b05    30893205
10/31/2019 20:35    d6b8cfbd-a9c2-4ffd-b336-0a23547445ea    89883627
10/31/2019 20:35    11db6b3a-9679-4422-b754-4c1b23aa4801    90302087


import pandas as pd 
import uuid as u 
from datetime import datetime
import datetime as dt 
 # generate a pseudo-identifier sequesnce using python random number generator library uudi.

def uudi_generator(length): 
    uudi_list= list() 
    i=0 
    while i < length: 
        uudi_list.append(u.uuid4()) 
        i+=1
    return uudi_list 

#import original originaL dataset 
dataset = pd.read_csv('C:\\mylocation\\input_credit_data.csv', index_col=False) 

# pseudo identifier
sLength = len(dataset['housing']) 
dataset.insert(0, 'uuid', pd.Series(uudi_generator(sLength), index=dataset.index)) 

# Transaction record attached to the original
dataset.insert(0, 'transaction_date', pd.Series([datetime.now()]*sLength, index=dataset.index)) 

#transcation record is attached to originaL data file 
dataset.to_csv('C:\\mylocation\\input_credit_data.csv',index=False) 

#delete identifiabLe record from dataset 
del dataset['firstname'] 
del dataset['lastname'] 

# export  de-identified dataset as csv to be shared with the user
dataset.to_csv('C:\\mylocation\\output_bankcredit-data.csv',index=False)
Peter
  • 7,896
  • 5
  • 23
  • 50
1

Everything is working fine. Few observations -

  • Indentation of your uudi_generator seems incorrect, but that may be an issue while pasting here
  • dt.datetime.now should be changed to dt.datetime.now()

----Edit seeing your comment above

Though your code is not achieving the goal you are looking for,

You may try something like below(moving ASCII value)..Just an example(code is very inefficient for big list)...

def shift_ascii(name_string):

print("Hello "+name_string)
newname_list =  [chr(ord(name_string[i])+2) for i in range(len(name_string))]
newname_string = ''.join(newname_list)
return  newname_string

print("Hello "+shift_ascii("Roshan"))
10xAI
  • 5,929
  • 2
  • 9
  • 25
0

I have used this file as input - [https://filebin.net/p2k4lqbxfh209zd2/bankcredit-data.csv?t=zcathulf]

import pandas as pd 

#to shuffle move ascii value of every char
def shift_ascii(name_string):
    newname_list =  [chr(ord(name_string[i])+2) for i in range(len(name_string))]
    newname_string = ''.join(newname_list)
    return  newname_string

#import original original dataset 
dataset = pd.read_csv('bankcredit-data.csv',encoding = "utf") 
for col in dataset.columns:
    col_type = dataset[col].dtype 
    if str(col_type) == "object": #to check if the col is String
        dataset[col] = dataset[col].apply(lambda x: shift_ascii(x))

# export  de-identified dataset 
dataset.to_csv('deidentified-data.csv',index=False) 
#index=false is to get rid of first index column that Pandas adds by default
#to get the main data back, pass the output csv and change +2-->minus 2 in shift_ascii Fn )
10xAI
  • 5,929
  • 2
  • 9
  • 25