How to Make Artificial Intelligence Less Biased - WSJ

As artificial intelligence spreads into more areas of public and private life, one thing has become abundantly clear: It can be just as biased as we are.

AI systems have been shown to be less accurate at identifying the faces of dark-skinned women, to give women lower credit-card limits than their husbands, and to be more likely to incorrectly predict that Black defendants will commit future crimes than whites. Racial and gender bias has been found in job-search ads, software for predicting health risks and searches for images of CEOs.

How could this be? How could software designed to take the bias out of decision making, to be as objective as possible, produce these kinds of outcomes? After all, the purpose of artificial intelligence is to take millions of pieces of data and from them make predictions that are as error-free as possible.

But as AI has become more pervasive—as companies and government agencies use AI to decide who gets loans, who needs more health care and how to deploy police officers, and more—investigators have discovered that focusing just on making the final predictions as error free as possible can mean that its errors aren’t always distributed equally. Instead, its predictions can often reflect and exaggerate the effects of past discrimination and prejudice.

In other words, the more AI focused on getting only the big picture right, the more it was prone to being less accurate when it came to certain segments of the population—in particular women and minorities. And the impact of this bias can be devastating on swaths of the population—for instance, denying loans to creditworthy women much more frequently than denying loans to creditworthy men.

Share Your Thoughts

How would you define “fairness” when it comes to AI algorithms? Join the conversation below.

In response, the AI world is making a strong push to root out this bias. Academic researchers have devised techniques for identifying when AI makes unfair judgments, and the designers of AI systems are trying to improve their models to deliver more-equitable results. Large tech companies have introduced tools to identify and remove bias as part of their AI offerings.

As the tech industry tries to make AI fairer, though, it faces some significant obstacles. For one, there is little agreement about what “fairness” exactly looks like. Do we want an algorithm that makes loans without regard to race or gender? Or one that approves loans equally for men and women, or whites and blacks? Or one that takes some different approach to fairness?

What’s more, making AI fairer can sometimes make it less accurate. Skeptics might argue that this means the predictions, however biased, are the correct ones. But in fact, the algorithm is already making wrong decisions about disadvantaged groups. Reducing those errors—and the unfair bias—can mean accepting a certain loss of overall statistical accuracy. So the argument ends up being a question of balance.

In AI as in the rest of life, less-biased results for one group might look less fair for another.

“Algorithmic fairness just raises a lot of these really fundamental thorny justice and fairness questions that as a society we haven’t really quite figured out how to think about,” says Alice Xiang, head of fairness, transparency and accountability research at the Partnership on AI, a nonprofit that researches and advances responsible uses of AI.

Here’s a closer look at the work being done to reduce bias in AI—and why it’s so hard.

Identifying bias

Before bias can be rooted out of AI algorithms, it first has to be found. International Business Machines Corp.’s IBM 0.42% AI Fairness 360 and the What-if tool from Alphabet Inc.’s GOOG 0.38% Google are some of the many open-source packages that companies, researchers and the public can use to audit their models for biased results.

One of the newest offerings is the LinkedIn Fairness Toolkit, or LiFT, introduced in August by Microsoft Corp.’s MSFT 2.83% professional social-network unit. The software tests for biases in the data used to train the AI, the model and its performance once deployed.

Fixing the data

Once bias is identified, the next step is removing or reducing it. And the place to start is the data used to develop and train the AI model. “This is the biggest culprit,” says James Manyika, a McKinsey senior partner and chairman of the McKinsey Global Institute.

There are several ways that problems with data can introduce bias. Certain groups can be underrepresented, so that predictions for that group are less accurate. For instance, in order for a facial-recognition system to identify a “face,” it needs to be trained on a lot of photos to learn what to look for. If the training data contains mostly faces of white men and few Blacks, a Black woman re-entering the country might not get an accurate match in the passport database, or a Black man could be inaccurately matched with photos in a criminal database. A system designed to distinguish the faces of pedestrians for an autonomous vehicle might not even “see” a dark-skinned face at all.

Photo: Keith A. Webb/Images: iStock

Gender Shades, a 2018 study of three commercial facial-recognition systems, found they were much more likely to fail to recognize the faces of darker-skinned women than lighter-skinned men. IBM’s Watson Visual Recognition performed the worst, with a nearly 35% error rate for dark-skinned women compared with less than 1% for light-skinned males. One reason was that databases used to test the accuracy of facial-recognition systems were unrepresentative; one common benchmark contained more than 77% male faces and nearly 84% white ones, according to the study, conducted by Joy Buolamwini, a researcher at the MIT Media Lab, and Timnit Gebru, currently a senior research scientist at Google.

Most researchers agree that the best way to tackle this problem is with bigger and more representative training sets. Apple Inc., AAPL 2.80% for instance, was able to develop a more accurate facial-recognition system for its Face ID, used to unlock iPhones, in part by training it on a data set of more than two billion faces, a spokeswoman says.

Shortly after the Gender Shades paper, IBM released an updated version of its visual-recognition system using broader data sets for training and an improved ability to recognize images. The updated system reduced error rates by 50%, although it was still far less accurate for darker-skinned females than for light-skinned males.

Since then, several large tech companies have decided facial recognition carries too many risks to support—no matter how low the error rate. IBM in June said it no longer intends to offer general-purpose facial-recognition software. The company was concerned about the technology’s use by governments and police for mass surveillance and racial profiling.

“Even if there was less bias, [the technology] has ramifications, it has an impact on somebody’s life,” says Ruchir Puri, chief scientist of IBM Research. “For us, that is more important than saying the technology is 95% accurate.”

Reworking algorithms

When the training data isn’t accessible or can’t be changed, other techniques can be used to change machine-learning algorithms so that results are fairer.

One way that bias enters into AI models is that in their quest for accuracy, the models can base their results on factors that can effectively serve as proxies for race or gender even if they aren’t explicitly labeled in the training data.

It’s well known, for instance, that in credit scoring, ZIP Codes can serve as a proxy for race. AI, which uses millions of correlations in making its predictions, can often base decisions on all sorts of hidden relationships in the data. If those correlations lead to even a 0.1% improvement in predictive accuracy, it will then use an inferred race in its risk predictions, and won’t be “race blind.”

Because past discriminatory lending practices often unfairly denied loans to creditworthy minority and women borrowers, some lenders are turning to AI to help them to expand loans to those groups without significantly increasing default risk. But first the effects of the past bias have to be stripped from the algorithms.

IBM’s Watson OpenScale, a tool for managing AI systems, uses a variety of techniques for lenders and others to correct their models so they don’t produce biased results.

Photo: Keith A. Webb/Images: iStock

One of the early users of OpenScale was a lender that wanted to make sure that its credit-risk model didn’t unfairly deny loans to women. The model was trained on 50 years of historical lending data which, reflecting historical biases, meant that women were more likely than men to be considered credit risks even though they weren’t.

Using a technique called counterfactual modeling, the bank could flip the gender associated with possibly biased variables from “female” to “male” and leave all the others unchanged. If that changed the prediction from “risk” to “no risk,” the bank could adjust the importance of the variables or simply ignore them to make an unbiased loan decision. In other words, the bank could change how the model viewed that biased data, much like glasses can correct nearsightedness.

If flipping gender doesn’t change the prediction, the variable—insufficient income, perhaps—is probably a fair measure of loan risk, even though it might also reflect more deeply entrenched societal biases, like lower pay for women.

“You’re debiasing the model by changing its perspective on the data,” says Seth Dobrin, vice president of data and AI and chief data officer of IBM Cloud and Cognitive Software. “We’re not fixing the underlying data. We’re tuning the model.”

Zest AI, a Los Angeles-based company that provides AI software for lenders, uses a technique called adversarial debiasing to mitigate biases from its credit models. It pits a model trained on historical loan data against an algorithm trained to look for bias, forcing the original model to reduce or adjust the factors that lead to biased results.

For example, people with shorter credit histories are statistically more likely to default, but credit history can often be a proxy for race—unfairly reflecting the difficulties Blacks and Hispanics have historically faced in getting loans. So, without a long credit history, people of color are more likely to be denied loans, whether they’re likely to repay or not.

The standard approach for such a factor might be to remove it from the calculation, but that can significantly hurt the accuracy of the prediction.

Zest’s fairness model doesn’t eliminate credit history as a factor; instead it will automatically reduce its significance in the credit model, offsetting it with the hundreds of other credit factors.

The result is a lending model that has two goals—to make its best prediction of credit risk but with the restriction that the outcome is fairer across racial groups. “It’s moving from a single objective to a dual objective,” says Sean Kamkar, Zest’s head of data science.

Some accuracy is sacrificed in the process. In one test, an auto lender saw a 4% increase in loan approvals for Black borrowers, while the model showed a 0.2% decline in accuracy, in terms of likeliness to repay. “It’s staggering how cheap that trade-off is,” Mr. Kamkar says.

Over time, AI experts say, the models will become more accurate without the adjustments, as data from new successful loans to women and minorities get incorporated in future algorithms.

Adjusting results

When the data or the model can’t be fixed, there are ways to make predictions less biased.

LinkedIn’s Recruiter tool is used by hiring managers to identify potential job candidates by scouring through millions of LinkedIn profiles. Results of a search are scored and ranked based on the sought-for qualifications of experience, location and other factors.

But the rankings can reflect longstanding racial and gender discrimination. Women are underrepresented in science, technical and engineering jobs, and as a result might show up far down in the rankings of a traditional candidate search, so that an HR manager might have to scroll through page after page of results before seeing the first qualified women candidates.

In 2018, LinkedIn revised the Recruiter tool to ensure that search results on each page reflect the gender mix of the entire pool of qualified candidates, and don’t penalize women for low representation in the field. For example, LinkedIn posted a recent job search for a senior AI software engineer that turned up more than 500 candidates across the U.S. Because 15% of them were women, four women appeared in the first page of 25 results.

“Seeing women appear in the first few pages can be crucial to hiring female talent,” says Megan Crane, the LinkedIn technical recruiter performing the search. “If they were a few pages back without this AI to bring them to the top, you might not see them or might not see as many of them.”

Other tools give users the ability to arrange the output of AI models to suit their own needs.

Pinterest Inc.’s PINS -3.85% search engine is widely used for people hunting for style and beauty ideas, but until recently users complained that it was frequently difficult to find beauty ideas for specific skin colors. A search for “eye shadow” might require adding other keywords, such as “dark skin,” to see images that didn’t depict only whites. “People shouldn’t have to work extra hard by adding additional search terms to feel represented,” says Nadia Fawaz, Pinterest’s technical lead for inclusive AI.

Improving the search results required labeling a more diverse set of image data and training the model to distinguish skin tones in the images. The software engineers then added a search feature that lets users refine their results by skin tones ranging from light beige to dark brown.

When searchers select one of 16 skin tones in four different palettes, results are updated to show only faces within the desired range.

After an improved version was released this summer, Pinterest says, the model is three times as likely to correctly identify multiple skin tones in search results.

Photo: Keith A. Webb/Images: iStock

Struggling with pervasive issues

Despite the progress, some problems of AI bias resist technological fixes.

For instance, just as groups can be underrepresented in training data, they can also be overrepresented. This, critics say, is a problem with many criminal-justice AI systems, such as “predictive policing” programs used to anticipate where criminal activity might occur and prevent crime by deploying police resources to patrol those areas.

Blacks are frequently overrepresented in the arrest data used in these programs, the critics say, because of discriminatory policing practices. Because Blacks are more likely to be arrested than whites, that can reinforce existing biases in law enforcement by increasing patrols in predominantly Black neighborhoods, leading to more arrests and runaway feedback loops.

“If your data contains that sort of human bias already, we shouldn’t expect an algorithm to somehow magically eradicate that bias in the models that it builds,” says Michael Kearns, a professor of computer and information science at the University of Pennsylvania and the co-author of “The Ethical Algorithm.”

(It may be possible to rely on different data. PredPol Inc., a Santa Cruz, Calif., maker of predictive-policing systems, bases its risk assessments on reports by crime victims, not on arrests or on crimes like drug busts or gang activity. Arrests, says Brian MacDonald, PredPol’s chief executive, are poor predictors of actual criminal activity, and with them “there’s always the possibility of bias, whether conscious or subconscious.”)

Then there’s the lack of agreement about what is fair and unbiased. To many, fairness means ignoring race or gender and treating everyone the same. Others argue that extra protections—such as affirmative action in university admissions—are needed to overcome centuries of systemic racism and sexism.

In the AI world, scientists have identified many different ways to define and measure fairness, and AI can’t be made fair on all of them. For example, a “group unaware” model would satisfy those who believe it should be blind to race or gender, while an equal-opportunity model might require taking those characteristics into account to produce a fair outcome. Some proposed changes could be legally questionable.

Meanwhile, some people question how much we should be relying on AI to make critical decisions in the first place. In fact, many of the fixes require having a human in the loop to ultimately make a decision about what’s fair. Pinterest, for instance, relied on a diverse group of designers to evaluate the performance of its skin-tone search tool.

Many technologists remain optimistic that AI could be less biased than its makers. They say that AI, if done correctly, can replace the racist judge or sexist hiring manager, treat everyone equitably and make decisions that don’t unfairly discriminate.

“Even when AI is impacting our civil liberties, the fact is that it’s actually better than people,” says Ayanna Howard, a roboticist and chair of the School of Interactive Computing at the Georgia Institute of Technology.

AI “can even get better and better and better and be less biased,” she says. “But we also have to make sure that we have our freedom to also question its output if we do think it’s wrong.”

Mr. Totty is a writer in San Francisco. He can be reached at reports@wsj.com.