A couple of months ago, there were reports on Amazon’s failed attempt to build an AI engine that would scan resumes and recommend candidates. The problem? The engine “taught itself that male candidates were preferable.” Reuters noted of Amazon’s engine:

It penalized resumes that included the word “women’s,” as in “women’s chess club captain.” And it downgraded graduates of two all-women’s colleges, according to people familiar with the matter.

Was this a case of a bad algorithm? Well, no. It turns out the issue was with the data that had been used to train the system:

The idea was for this AI-powered system to be able to look at a collection of resumes and name the top candidates. To achieve this, Amazon fed the system a decade’s worth of resumes from people applying for jobs at Amazon.

The tech industry is famously male-dominated and, accordingly, most of those resumes came from men. So, trained on that selection of information, the recruitment system began to favor men over women. (Source: Fortune)

When thinking about AI, it’s not uncommon for vendors, customers and the public to focus on the algorithms, but the fact is, data often plays a more important role. There is only so much an algorithm can be tuned; after a certain point, there are diminishing returns. Beyond the algorithms, one must curate, understand and classify the data being fed into the system. And that can have a number of challenges.

First question: whose data needs to be fed into an AI engine? As a customer, are you expected to feed some data into the engine? What data do you control versus your suppliers? What expectations does your AI vendor or security vendor with an AI engine set for you? How are you expected to classify or label your data? What are the impacts to the AI engine if you cannot provide the requisite amount of data or if the data is not normalized or well classified?

Sometimes it’s a challenge to find the requisite data at all. If the data is incomplete (perhaps you can only provide a subset of the recommended data), the results are less certain. Data may be in multiple locations, and it may take time for you, your vendors or your partners to find everything. The data may contain sensitive information that needs to be sanitized before it is fed into an engine (perhaps account numbers or certain types of PII need to be scrubbed to avoid regulatory consequences). And, as we saw in the Amazon example, the data itself can be biased, leading to incorrect decisions or biased learning.

Once you have the data, it’s likely you or your vendor will need to do work to make is usable. Data often needs to be classified and labeled, so that systems understand what they are looking at and how to analyze it. There is an entire discipline in Artificial Intelligence called feature engineering – this is about understanding how to make cuts in the data so that the features that algorithms analyze on make sense and are relevant. Feature engineering is a less appreciated but critical element of AI; just like algorithm development, it is difficult and requires specific expertise.

Beyond incomplete, biased or poorly classified data, data poisoning can also doom your AI implementation. While the Amazon example focused on the poor results an algorithm can produce if the organization is unable to produce a full and unbiased data sample, the intent was not malicious. Data poisoning occurs when attackers intentionally change the data being fed into an AI engine in order to manipulate its response. A classic example is the TayBot, an AI driven chatbot introduced by Microsoft on Twitter in early 2016. The idea was that Tay would learn and interact with users based upon the messages shared with it; users quickly inundated Tay with racist and misogynist messages, so it “learned” to tweet out similarly hateful messages. Microsoft removed Tay from Twitter within a day. AI, like many technologies, can be used to attack as well as to protect, and manipulating the data is popular attack approach.

So what does all this mean for you? When considering AI engines and vendors, we can’t ignore the data questions. In Part I, we provided some questions to ask about the type of machine learning, expertise required and supervision in place. Here are some questions to ask about the data:

  • What data do you provide?
  • What data do I need to provide? What is the impact to the system if I cannot provide certain data?
  • What classification or labeling is needed before the data is usable? Who is expected to do that?
  • How do I know if the data I have will introduce bias into the system?

Asking about data requirements and implications is something we all can do. In Part I of the blog series, we discussed how supervision might represent a hidden cost to you. Finding and classifying required data is another potential cost, so it’s wise to understand you degree of responsibility, and the impact to the machine learning engine if you struggle to find and clean the data.

Up next: AI systems are not just about coding. We write about how to seek deep expertise from data scientists and subject matter experts and why.

Sandra Carielli
Sandy Carielli has spent over a dozen years in the cyber security industry, with particular focus on identity, PKI, key management, cryptography and security management. As Director of Security Technologies for Entrust Datacard, Sandy guides the organization’s next generation security and technology strategy. Prior to Entrust Datacard, Sandy was Director of Product Management at RSA, where she was responsible for SecurID and data protection. She has also held positions at @stake and BBN. Sandy has been a speaker at RSA Conference, SOURCE Boston, the NYSE Cyber Risk Board Forum and BSides Boston. She has a Sc.B. in Mathematics from Brown University and an M.B.A. from the MIT Sloan School of Management.