Plunging Head-on Into Language AI

Danger!

We all talk about the immensely increased complexity we are now faced with as the result of increased network density, micro-services, layered segmentation, shadow IT proliferation, misconfigurations, hybrid-cloud instances, Dockers and Kubernetes, the absence of sufficient security by design in development, poor hygiene, over-worked analysts, deep tool stacks and understaffed teams.

And the upcoming challenges of 5G speed.

The Road that Led Us Here

We don’t talk about how we got here.

It wasn’t just the combination of the COVID-19 pandemic and the push for digitization. The complexity trend was well on its way 3 years ago when threat actors began to organize into attack teams and discovered new paths to vulnerability, aided by cybersecurity companies who boasted publicly about vulnerability discoveries. Our response was to find new point solutions, hurriedly install and learn how to operate them, while trying to hold off attackers with our other hand.

We paid lip service the preparedness and skills required and soldiered on, driven by a host of good reasons.

Now the chickens have come home to roost. First in December with the SolarWinds and Accellion breaches, followed by the Microsoft reveal and the 100 other technology vendors who admitted infections and now, the CI attack on Colonial Pipeline – how did the threat actors know that Colonial was pivotal in the supply of fuel to east coast retailers? Or, was this a NotPetya in mini-form factor?

Or, something else entirely.

Rushing into the AI Abyss

While all of this is happening and should be schooling us about the perils of unfettered technological progress, we are instead rushing into the AI abyss as though our lives depend on it.

Which they probably do.

The Magic of Machine Learning

On May 18, Google CEO, Sundar Pichai, announced an amazing new tool, a breakthrough in language AI. A system called LaMDA that can learn how to respond to any question and/or comment in any conversation, about any subject, in any context, and it is so accurate and nuanced that humans will never be able to figure out whether it is doing so as a machine or whether its comments are coming instead from humans.

LaMDA will be integrated with Google’s main search portal, its voice assistant and Workplace, its collection of cloud-based work software that includes Gmail, Docs and Drive. But the eventual goal, said Pichai, is to create a conversational interface that allows people to retrieve any kind of information (text, visual, audio) across all of Google’s products without discrimination as to what is real and what is fake.

MDA may stand for Model Driven Architecture, Missile Defense Act, Mobile Digital Assistant, Machine Data Acquisition, or Methylenedioxyamphetamine, but Google’s not saying. If it’s Ecstasy, it sure won’t be for any of us out here.

Why?

Because large language models (LLM) are built from deep-learning algorithms that train on enormous amounts of unfiltered text data. By themselves. Unfiltered and unsupervised. It is the magic of machine learning.

Dark Magic

Studies have shown that a myriad of racist, sexist and abusive thought experiments are embedded in these models. Not just historical data that one would expect to be prejudiced but contemporary conversational streams that cause these algorithms to conclude that doctors are men and nurses are women and firefighters are brave and Hilary Clinton is an alien force from an evil planet.

Probed with certain prompts, they will also begin to encourage genocide, self-harm, tribal fear, re-imagining and child sexual abuse. Their insanely great fluency will create and enable the mass production of misinformation. Not that we need any more help with that after what we’ve seen in the 2016 election cycle.

Confirmation Bias? Confirmed.

These algorithms are so good and so open to confirmation bias that two of the AI project’s leads at Google – both Timnit Gebru and Margaret Mitchell – were summarily dismissed after publicizing their concerns about the inherent biases embedded into these machines in a breathtaking abdication of the company’s fabled mantra.

No, I think we stopped ‘doing no harm’ over a decade ago.

But it’s not just Google. We have OpenAI’s GPT-2 and GPT-3, which produce remarkably convincing passages of text not just in response to a query, but simply on a topic – like, Mayhem – and can even finish off music compositions and computer programming code. Microsoft now exclusively licenses GPT-3 to incorporate into some of their mysterious future products. Facebook uses its own LLMs for translation and content moderation. And there are a dozen startups creating products and services based on the tech giants’ models.

Very soon, all of our digital interactions, whether email, search or social media posts, will be filtered through LLMs.

Out Over Our Skis

On the flip side, very little research is being done to understand how the flaws of this technology could affect people in real-world applications, or to figure out how to design better LLMs that mitigate these challenges. As we got far ahead of our skis in cloud computing, micro-services and segmentation, we are doing exactly the same in this domain.

Google made it clear in its treatment of Gebru and Mitchell that the few companies rich enough to train and maintain LLMs have a serious financial interest in declining to examine them carefully. In other words, LLMs are being integrated into the linguistic infrastructure of the internet on virtually no scientific foundations.

Which is fine in a capitalist economic system, and I applaud everyone who figures out how to make more out of their resources than may appear to a cursory glance at the surface. It’s how we all roll.

Nudging Things in the Right Direction

But some are fighting back.

More than 500 researchers around the world are now racing to learn more about the capabilities and limitations of these advanced linguistic models. A startup project known as BigScience is taking an “open science” approach to understanding natural language processing (NLP), and they are building an open-source LLM that will serve as a shared resource for the scientific community. The goal is to generate as much scholarship as possible as quickly as possible. “We can’t really stop this craziness around large language models, where everybody wants to train them,” says Thomas Wolf, the chief science officer at Huggingface, who is co-leading the initiative. “But what we can do is try to nudge this in a direction that is in the end more beneficial.”

One promising company named Cohere, which was started by former Google researchers, promises to bring LLMs to any business that wants one, with a single line of code. Among its early clients is another startup called Ada Support, which provides a platform for building no-code customer support chatbots, and they already have clients like Facebook and Zoom. More impressively Cohere’s investor list includes computer vision pioneer, Fei-Fei Li; Turing Award winner, Geoffrey Hinton; and Apple’s head of AI, Ian Goodfellow.

So, there is hope.

Stochastic Parrots

But, as Gebru noted in the publication that got her fired, and in which she refers to LLMs as “stochastic parrots,” if fake news, hate speech and even death threats aren’t moderated out, they are then scraped as training data to build the next generation of LLMs, and it is those models, parroting back what they’re trained on, that end up regurgitating these toxic linguistic patterns on the internet.

Which advancing approaches to LLMs do you think will prevail?

Will it be the science that holds the hope of assuring scholastic content is available to mitigate the blind absorption of misinformed influences, or will the technology that needs to race to market, win in the end?

Read more: