Understanding Bias in Facial Recognition: Past, Present, and How Biometrica Builds Better

When we started to work on this overview, we decided we needed to answer one basic question first:

How Do Face Recognition Systems Work?

At its very basic, Facial Recognition, which was created by the application of mathematical techniques to the study of the human brain, has four elements to its Technology:

The Algorithm, inspired by the human brain.
A Camera, in order to have vision, inspired by the human eye.
A Database, needed in order to have a memory of known and unknown humans, inspired by the ability of a human memory to remember faces (a typical human can recall about 5,000 faces across their lifetime, according to research).
Processing Power, also inspired by the innate processes of the human brain.

As a biometric system, a Facial Recognition system, as generally understood, operates in either or both of two modes when a face is detected:

Face Verification (or authentication)

Face authentication or verification involves a one-to-one match that compares a query face image against a single enrollment face image whose identity is being claimed. The verification of an individual for a self-serviced immigration clearance using an e-passport is one typical application of this. Matching your own face against your phone is another example.

Face Identification (or recognition)

Face identification or recognition, on the other hand, involves one-to-many matching that compares a queried face against multiple faces in an enrollment database, in order to associate the identity of the queried face to one of those in the database. In some identification applications, one just needs to find the most similar face. In a background check or face identification being done for an investigatory lead, the requirement is more than finding similar faces; typically, a confidence level threshold is specified and all those faces whose similarity score is above the threshold are reported.

Note: The performance of a face recognition system largely depends on a variety of factors such as lighting, facial pose, expression, age span, hair, facial wear, source device, modifications and motion, in addition to the diversity of the dataset and the data the algorithm itself has trained against.

Controversy: The Whys and Whats

Facial recognition technology (FRT), though, has long been at the center of heated debates about privacy, fairness, and accuracy. Much of this scrutiny emerged following early academic and governmental studies, including a widely cited 2018 paper by MIT Media Lab’s Dr. Joy Buolamwini, co-authored by Dr. Timnit Gebru, which found that commercial facial recognition algorithms disproportionately misidentified women and persons of color — error rates were as much as 34.7% for African American women as compared to a maximum of 0.8% for Caucasian men.

This paper was later reinforced by a NIST 2019 study that empirically demonstrated significant demographic differentials across algorithms. NIST evaluated 189 mostly commercial algorithms from 99 developers.

That NIST study used four large datasets of photographs collected in U.S. governmental applications that were then in operation:

Domestic mugshots collected in the United States.
Application photographs from a global population of applicants for immigration benefits.
Visa photographs submitted in support of visa applicants.
Border crossing photographs of travelers entering the United States.

Together, these comprised about 18.27 million images from about 8.49 million people.

There were two inherent issues with the study. The first was the study size.

Even MIT Media Lab’s oft-quoted landmark study, which tested three commercially available systems, and was reportedly the first of its kind to have gender parity in a dataset, used a dataset of just 1,270 publicly available faces, that of lawmakers from a diverse set of countries (from Africa to the Nordic nations) with a high number of women holding office.

While fairly large by U.S. or western European standards, it remained small by the standards of algorithmic testing done at scale in Asia. Second, NIST itself admitted that while the first three sets (mugshots, application photos for immigration benefits, and visa applicant photos had “good compliance with image capture standards,” the last set, that of photos of persons crossing the border, “did not, given constraints on capture duration and environment.”

What did they mean by that?

FR applications may be divided into two broad categories in terms of a subject’s cooperation:

Cooperative Subject Scenarios
Noncooperative Subject Scenarios

The cooperative case is encountered where the subject is willing to be cooperative by presenting his/her face in a proper way (for example, in a frontal pose with a neutral expression and open eyes). In the noncooperative case, which is more typical in security and surveillance applications, the subject is generally unaware of being identified.

According to the Handbook of Face Recognition, “In terms of distance between the face and the camera, near field face recognition (less than 1 meter) for cooperative applications is the least difficult problem, whereas far field noncooperative applications is the most challenging.”

Applications in-between the above two categories can also be expected. For example, in face-based access control at a distance, when the subject is willing to be cooperative, but is unable to present his/her face in a favorable condition with respect to the camera. This may present challenges to the system even though such cases are still easier than identifying the identity of the face of a subject that is not cooperative. However, according to the Handbook’s editors, in almost all cases, ambient illumination is the foremost challenge for most FR systems.

Despite all this, these concerns raised in 2018 and 2019 were still valid — at the time. However, the FRT landscape has changed significantly since then. Advances in machine learning, the adoption of global and more balanced training datasets, and the development of stronger algorithmic architectures have greatly reduced the demographic bias problem.

Biometrica is not a facial recognition company. We do use NIST-approved third-party providers that provide us with results from an FRT search query run against our UMbRA database, which is 100% law enforcement-sourced. We have no access at any point to biometric templates or biometric identifiers from that query, which is conducted in a secure and isolated black box environment by the NIST-approved third party. But we have gone a step further by intentionally designing our systems and solutions around privacy, human oversight, and the prevention of mass surveillance. This explainer outlines both the historical challenges and how we, and the broader industry, have adapted.

The Early Problems with FRT: Why Bias Existed in Early Algorithms

Small, Unbalanced Datasets

Early systems were trained on datasets like “Labeled Faces in the Wild (LFW)” or “CelebA,” which were heavily skewed towards white male celebrities, with some datasets reportedly as much as 75% male and 80% white.

Underrepresentation

Women, children, African American, Indigenous, Asian, and other non-white populations were highly underrepresented.

Algorithmic Limitations

The architectures available at the time were less capable of learning from limited or imbalanced data.

Lighting and Image Quality

Early algorithms also struggled in low-light conditions or with poor-quality images, which disproportionately affected marginalized communities. Logically, if you trained algorithms on datasets that skewed heavily male and heavily white, and had poor picture quality, it would be harder for those algorithms to recognize people, including women and persons of color.

Human biases complicating matters

What does this mean? In simple terms, if we humans have inherent biases and a predisposition to believe ill of people of a certain race or ethnicity, without effective algorithms and algorithmic accountability, appropriate training, or a system of checks and balances in place, our biases will be reflected in our decision-making process.

But things did change. And algorithms did get better as they were exposed to more and more diverse data. In addition, FR algorithms grew stronger as developers also understood that if a face was part of a time series — that is, there was a series of photos of an individual over a period of time instead of many photos at the same time — it also made for a better algorithm. Again, algorithms could be trained.

So, what did the NIST 2019 Study on Demographic Differentials find?

False positive rates were significantly higher for Asian and African American faces compared to white faces on U.S. developed algorithms.
Native American and Indigenous faces showed some of the highest error rates.
Women, children, and elderly individuals were also more likely to be misidentified.

Critically: NIST found that algorithms developed in Asian countries did not show the same biases against Asian faces, directly tying diversity of training data to performance.

The Turning Point: Advances Driving Change

Since 2019, the industry has made remarkable progress:

Global and Diverse Training Datasets: Modern algorithms are trained on datasets representing millions of diverse individuals across gender, age, and ethnicity.
Algorithmic Improvements: There have been significant progress in convolutional neural networks (CNNs) and deep learning.
NIST FRVT Benchmarks: Ongoing independent testing has shown that leading algorithms today outperform older models by orders of magnitude.

NIST FRVT 1:N (2020–2024)

The most recent NIST results demonstrate:

Drastic reductions in bias across demographic groups.
Several algorithms achieve near-parity between demographic groups.
Significant reductions in false positives and false negatives.
Continuing improvements year-on-year.

Biometrica’s Approach Against This Backdrop:
Purposeful Design Choices

We chose not to develop any FRT algorithms, in order to prevent our own access to facial templates and any biometric identifiers, so there would be some separation of functionality. Instead, we decided to focus on the data and amalgamate, clean up and standardize the data itself.
We only license NIST-tested, FedRAMP Moderate/High approved third-party algorithms.
All FRT operations are performed in a secure, isolated black box environment to which Biometrica has no access.
We made a conscious decision to have data provenance and a legally permissible reason for ingesting data. That meant we could not ingest any data from social media, and we also chose not to ingest data from credit reports, property records, licenses, etc. To be in our dataset, an individual had to have a verified law enforcement history, and because of that, be reasonably believed to be a threat to public safety at a particular point in time, or, be a victim.
Our systems do not ingest nor retain any juvenile data, unless a juvenile offender has been charged as an adult or our sensors are searching for a minor, who is in our database because they have been classified by an agency as a missing person. This data is non-searchable.
We separate all data silos at the back end to prevent reassembly of a full individual profile, thereby protecting PII in case of any attack. If the systems are ever breached — which has not happened in more than 25 years but it’s best to be prepared — any hypothetical digital intruder would have access only to one piece of any person’s record, and not the whole and no ability to get all the pieces to make up the whole without access to a unique key.

The Human-in-the-Loop Safeguard

Facial recognition is most effective when used as a combination of human intelligence and machine intelligence. Why? Because human and machine intelligence balance each other out. Algorithmic intelligence allows us to take huge amounts of data, sift through it and process it at great speed to provide a handful of potentially life-saving results in real-time, while eliminating some inherent human biases in decision-making processes.

Humans have the ability to sift through that handful of results, validate it, and make discerning choices taking other factors into consideration, thus ensuring both algorithmic accountability and humaneness, and reducing the chance of a false positive — or a false negative — that would negatively impact an individual’s life, and especially disproportionately impact disadvantaged communities. With that in mind, Biometrica implemented the following protocols:

Every algorithmic result received via our third-party NIST approved algorithm for our RTIS/RVIS system is reviewed by a trained human analyst before being considered actionable for every single Biometrica solution.
Analysts are trained in anti-bias protocols.
Relevance, not mere identity, is the trigger for the alerts.

Data Minimization and Guardrails

We have no biometric gallery.
We do not transmit, access, or retain biometric identifiers, or biometric templates.
No one — not Biometrica, and not law enforcement — has live access to any sensor.
All events are logged for audit purposes.
Relevance-based alerts only.

Why Acknowledging the Past Should Not Impede Progress

The legitimate criticism leveled against early FRT systems changed the trajectory of this technology. However, it is a mistake to assume that those challenges persist at the same scale today. Algorithms have improved, datasets have become global, and safeguards — including mandatory human review and verification — have matured.

Biometrica believes that dismissing responsible, privacy-protective solutions out of a fear of outdated shortcomings denies communities — especially communities that are economically disadvantaged and historically underrepresented in access to affordable technology solutions — and law enforcement agencies in those communities and beyond, access to powerful tools for finding missing persons, preventing violence, protecting vulnerable populations, balancing out human bias, stopping human trafficking and having accountability.

Our system does not enable mass surveillance. It enables focused, lawful, and rights-respecting alerts — nothing more, nothing less.

How Biometrica Built It Better

Implemented privacy by design in the systems architecture.
Ensured data provenance at scale.
Followed data minimization principles.
Put in human-in-the-loop verification by design and oversight.
Focused on relevance and context post verification — we do not want to help paint a scarlet letter on someone for past transgressions that have no relevance to their present or anyone else’s.
Purpose-built for public safety, not for mass surveillance. Systemic solutions are actually built to prevent mass surveillance.
Compliance with all U.S. data privacy and biometric privacy laws, GDPR, DPA, the AI Act, Quebec Law 25, PIPEDA, and other frameworks.
Continuous training of personnel on responsible and fair usage.
Continuous analysis of legal and judicial regulatory frameworks and adapting to that framework.
Transparent and auditable solutions, built for community stakeholders to participate in the decision-making process.
Implemented a very comprehensive Recommended Use of FRT policy for law enforcement that use our data and solutions.

To learn more, contact us