Updated December 2018
Earlier this year, California passed the California Consumer Privacy Act of 2018, or CCPA for short. Beginning in January 2020, companies will be required to comply with this new law. It places new restrictions on how companies handle personal data, including minimum damages for class action suits in response to data breaches.
Dattell is a data architecture and machine learning consulting company that provides strategy, engineering, and perspective about data collection, data storage, automation, data security, machine learning, and data visualizations.
In Part 1, we reviewed the new law and outlined why it is important for technology teams to be familiar with it. In Part 2, we will describe specific technology and data architecture optimizations for minimizing your company’s risk of a data breach.*
Why should technology teams care about the CCPA?
Civil class action suits in response to data breaches or data theft of sensitive information (such as social security numbers or bank accounts) can include either actual damages or damages in the amount ranging from $100 to $750 per consumer per incident.
For instance, if unauthorized parties gain access to a database containing sensitive information for 100,000 California residents, this could include damages of a minimum of $10,000,000 and up to $75,000,000.
Below we outline how to avoid data breaches. We will also provide best practices for storing data such that other technology requirements in the law can be met.
Avoid Data Breaches and CCPA Penalties
Even before the CCPA was passed, separating out your company’s and users’ sensitive information from the most trafficked and highly accessible databases has made good sense.
The idea is that you want to keep the most sensitive information in a database that sits behind firewalls (port-based and web access). Access to a database should require the use of a specific machine that is tightly monitored (see below).
By reconfiguring your data processing to put complete user data in a firewall protected database and then only extracting redacted or anonymized data for broader consumption will allow you to comply more easily. Of course, this won’t be possible for all business models, but for those that can take this approach it is the most robust.
Database separation is an important component of data security, and it supports efforts towards data privacy for customers. We will discuss two types of separation: Physical and Network.
Physical database separation.
For on-premise servers, physical separation is attained by splitting databases over multiple servers. For cloud-based infrastructure, physical separation is attained by splitting databases over multiple instances.
The reason databases are split over multiple servers or instances is that if one server/instance is hacked, the hack is isolated to that specific server/instance.
Additionally, when databases are physically separated, not one individual server or instance contains all of the database’s information. This also allows web access firewalls to be more specific (see below for web-access firewalls).
Every computer has an IP address, i.e. the network and subnet the IP is in. The IP address can speak freely to other IP addresses inside its subnet. Network separation is when separate subnets between servers or instances require firewall access.
For a secure data architecture, you want to include both kinds of separation. The figure below depicts how the databases are separated.
A web-access firewall only lets employees or applications talk to a database in a specific, predetermined way. By way we mean if a database is hosting customer, payment, and operations data, the web-access firewall would allow any of those three requests.
However, if an account is phished and access is requested for sales data, that request would be denied before it reaches the database. Additionally, the suspicious behavior will be logged and alerted on (see below for more on alerts).
When a program wants to communicate on the network it will listen on its IP address and port number. Port-based firewalls only allow communications to happen over specific ports.
The point of the port-based firewall is to deny access if a program or compromised machine/account try to gain access to the wrong port. This firewall will also log suspicious activity.
As the figure above shows, there is both an inbound and outbound port-based firewall. The inbound port-based firewall limits the channels that can send requests to the database. The outbound port-based firewall limits the channels that the database can respond to.
When clients are looking to build new databases, we recommend NoSQL (not only structured query language) because it offers the most capability for scaling. There are few things worse in the technology space than putting forward the money and effort to build a new database only to find it is no longer serving your needs after a short time.
The new CCPA changes what is considered sensitive information. Because access to your databases can be obtained through hacking servers or phishing employee accounts, it is ideal to reduce the number of employees that have access to users’ personal information.
Access Control is a security measure that regulates who (employees) or what (applications) have access to view or use information in a computing environment.
When granting access to a database, it is important to set parameters for what information or sets of information that employee or application has access to within the database.
In the figure above, the three users (Jess, Ryan, and Application) each have different permissions determined by what access each requires to complete assignments.
For example in the entertainment space, one employee or application might have access to the lists of videos watched, and another employee or application might have access to videos watched and the accounts from which they were watched.
By setting tight restrictions, the amount of access is limited for both the employee and any ill-intentioned hacker or phisher that might infiltrate his or her account.
Alert Suspicious Activity Before a Data Breach Occurs
When implemented correctly, Machine Learning tools detect suspicious behavior before systems are compromised. In other words, you get the information you need to take action immediately before data is breached.
Machine Learning is proficient at detecting patterns and anomalies in data. Once access control is set for users, Machine Learning can identify when database usage for any individual user is outside of the normal range.
For instance, if a user accesses a particular database on average one time per week, and then in a single hour accesses it 38 times, then Machine Learning will detect that anomaly and alert the designated parties.
You might be wondering why you need Machine Learning to alert on anomalies. The figure above depicts where threshold based alerting can miss suspicious behavior.
The black line depicts how the number of employee access events for a database changes over time. For this specific employee, his behavior changes depending on the time of the year. However, it does keep to a semi-regular pattern of increases and decreases depending on the date.
Threshold Based Alerting.
With threshold based alerting, the threshold is typically set and not changed. In the figure above, the threshold was set (purple lines) during a time of mid-range access events.
The problem is that the employee surpasses this threshold during times when workload demands more access events, and the employee accesses the database a below threshold number of times during low workload periods.
The threshold based alerting could over-alert during busy times and under-alert during slow times.
With Machine Learning based alerting, the boundaries for alertable events change as the average number of access events for an employee or application changes throughout the year (green lines).
This process makes for a more accurate alerting system without a need for direct intervention as the Machine Learning algorithm updates on its own in real-time as new trends develop.
Machine Learning for Identifying Suspicious Activity.
Monitoring and automated alerting is important because it notifies your technology team of suspicious behavior when it starts—not two hours later or four months later.
In the example above, Jess’s account can be immediately disabled so that no further failed access attempts happen. Then an investigation can be initiated into her account activity.
With potential damages for data breaches starting at $100 per user, per incident, blocking an attack has the potential to save your company millions of dollars in damages.
Inappropriate Data Access.
Another way Machine Learning can identify a hacking or phishing event is if a user tries to access parts of the database that he or she does not typically access or does not have permission to access.
This kind of monitoring is also effective for detecting employees that are accessing or downloading data inappropriately.
Here are a few common alerts we routinely set up for our clients:
- Alert when an employee or application fails to access a database an abnormal number of times.
- Alert when an employee’s or application’s access is out-of-range of normal behavior.
- Alert when an employee or application tries to access an index outside of pre-assigned access.
- Alert when other anomalous behavior is identified with Machine Learning.
* DISCLAIMER – The information provided in this post should serve solely as an overview for readers to understand why certain technology optimizations could be helpful for their companies, and it should not serve in any way as or take the place of legal advice. Companies should consult a legal professional and the law directly for more information.