Protecting against both implicit and explicit bias as always been an important aspect of deploying machine learning models in regulated industries, such as credit scores under the Fair Credit Reporting Act (FCRA) and insurance underwriting models under the requirements of state regulators.
Over the last few years, there has been a lot surprise about how easy it is for bias to become part of large learning datasets used in transfer learning, such as imageNet, and for bias to influence machine learning systems that are not carefully built or adequately tested.
In part, this is because machine learning and deep learning frameworks, such as TensorFlow, PyTorch and Keras, make it quite easy to develop a model, without understanding how the model works internally, and do not provide tools for testing against bias.
For historical context, it is useful to look at the language in the FCRA (Section 1002.2(p)(1)) that was originally passed in 1970, about fifty years ago.
(1) A credit scoring system is a system that evaluates an applicant’s creditworthiness mechanically, based on key attributes of the applicant and aspects of the transaction, and that determines, alone or in conjunction with an evaluation of additional information about the applicant, whether an applicant is deemed creditworthy. To qualify as an empirically derived, demonstrably and statistically sound, credit scoring system, the system must be:
(i) Based on data that are derived from an empirical comparison of sample groups or the population of creditworthy and non-creditworthy applicants who applied for credit within a reasonable preceding period of time;
(ii) Developed for the purpose of evaluating the creditworthiness of applicants with respect to the legitimate business interests of the creditor utilizing the system (including, but not limited to, minimizing bad debt losses and operating expenses in accordance with the creditor’s business judgment);
(iii) Developed and validated using accepted statistical principles and methodology; and
(iv) Periodically revalidated by the use of appropriate statistical principles and methodology and adjusted as necessary to maintain predictive ability.
The FCRA is enforced by the Federal Trade Commission and the Consumer Financial Protection Bureau. (The seal of the Consumer Financial Protection Bureau is at the top of this post.)
The key words italicized here, and in the FCRA itself, are: “empirically derived, demonstrably and statistically sound.” A good way of thinking of this is that if there are two subpopulations each with a variable at certain incidence level, say defaulting on credit card payments, and a model (say a default model) that is predicting this variable from certain features, then in general, the predicted incidence level of this variable should approximately match the natural incidence level of this variable.
As a simplified example, if males between 25-35 years of age that are currently employed and have been employed continuously for at least 3 years have a credit card default rate of 3.25% and your predictive model adds a zip+4 feature to predict default rates (something that should not be done in a credit or default model), you might be inadvertently introducing a bias in the model due to correlations between zip+4 and certain factors that should not be used as features in credit or default models. For this reason, usually just the first three digits of a zip code are used as a feature in credit or default model.
Bias comes into all machine learning in various ways:
- Labeled training data can be biased for the simple reason that is labeled by individuals with either implicit or explicit biases.
- Training data can be biased because it comes from systems that are not used by individuals from all social-economic, ethnic or other groups that the system may be applied to in the future. For example, facial analysis system often have higher error rates for minorities due to unrepresentative training data, for example.
- Biases may be presented in components of an algorithm, for example from a general transfer learning dataset from a third party that is used in a deep learning model prior to using data for the specific model being developed.
On the other hand, systems that use machine learning have some important advantages compared to manual systems:
- They are process data automatically and uniformly and do not reflect prejudices of humans. As an example, credit based on machine learning provided credit to many who were previously “red-lined” and denied credit.
- Algorithms can be examined for biases.
There is an increasing amount of information about best practices for building machine learning and AI models that identify and work to eliminate implicit and explicit biases. Some standard advice for guarding against biases in machine learning that is adapted from Google’s Inclusive ML Guide [https://cloud.google.com/inclusive-ml/] includes
- Design your model from the start with concrete goals for fairness
- Use representative datasets to train and test any systems
- Check the system for unfair biases
- Analyze performance