Generative AI’s Role in Cybersecurity Data Analysis

wade baker

In this episode of Cybersecurity (Marketing) Unplugged, Wade also discusses:

  • The rapid adoption of generative AI and its expanded use cases through OpenAI’s GPT marketplace and plugins;
  • The effectiveness and limitations of generative AI in analyzing large data sets, including the necessity for human oversight;
  • The potential of generative AI in specific applications, such as quantitative risk analysis;
  • Future speculation on the impact of generative AI on data analysis over the next decade.

Cybersecurity data analysis can mean different things to different people.  For CISOs, it could mean analyzing and auditing data feeds and logs for anomalies.  For vendors, it could mean the same thing, but specific to their own products allowing them to provide accurate, actionable information to their clients.  And for marketers, it could mean analyzing data to produce research-oriented content meant for thought leadership, awareness, and ultimately lead generation purposes.  One thing is certain, data analysis in some way, shape, or form, is critical to all aspects of the cybersecurity ecosystem.

But where does Artificial Intelligence fit in?  Does it fit?  Can it be used effectively to assist with data analysis?

To help us unpack it all we’ve invited Wade Baker, Co-Founder of Cyentia Institute – a leading firm focused on cybersecurity data-driven research.  If any listeners are familiar with our CISO Engagement Driver studies, Cyentia was responsible for crunching the massive set of intent data that was analyzed to find all those juicy correlations.

Let's talk about competitive research analysis. You're in a space and you want to publish content, and you want to publish content that doesn't just mimic what your competitors say, and tread over the same data points. At the same time, you need to know what their data points and main findings and other things are. I think this is an excellent use case.

Full Transcript

This episode has been automatically transcribed by AI, please excuse any typos or grammatical errors. 

Mike D’Agostino: [00:26]

Welcome everybody to another episode of Cybersecurity Marketing Unplugged. I’m your host, Mike D’Agostino. Cybersecurity data analysis can mean different things to different people. For CISOs, it could mean analyzing and auditing data feeds and logs for anomalies. For vendors, it could mean the same thing, but specific to their own products, allowing them to provide accurate actionable information to their clients. And for marketers, most of our listeners, it could mean analyzing data to produce research oriented content meant for thought leadership, awareness, and ultimately lead generation purposes. One thing is certain though, data analysis in some way, shape or form is critical to all aspects of the cybersecurity ecosystem. But where does artificial intelligence fit in? Does it fit? Can it be used effectively to assist with data analysis? To help us unpack it all, we’ve invited Wade Baker, co-founder of Cyentia Institute, a leading firm focused on cybersecurity data driven research. If any listeners are familiar with our CISO engagement driver studies, Cyentia was responsible for crunching the massive set of intent data that was analyzed to find all those juicy correlations. Welcome to the show, Wade.

Wade Baker: [01:48]

Hey, thanks, Mike. I appreciate the invite, and it’s very interesting topic for sure.

Mike D’Agostino: [01:53]

Yeah, no doubt. You were, the first who came to mind when we thought of this. I didn’t do justice at all introducing you and Cyentia to our listeners. Why don’t you take a minute or two, just to expand on yourself and your background, and exactly what Cyentia Institute focuses on?

Wade Baker: [02:10]

I appreciate that. Cyentia Institute, like you said, we do data science and research. Everything we do is focused on cybersecurity. Generally, we are working with security vendors that have interesting datasets collected from their products or service that they offer. And they’re coming to us to analyze those datasets and extract insights and develop content, long-form reports, short-form infographics, interactive, whatever, for their audience, and this is a thing we’ve been doing for years under the guise of Cyentia. But for me, personally, and for my co-founder, it goes way back before that. Lot of people are familiar with the Data Breach Investigations Report at Verizon, I started that after I came across a bunch of forensic reports. I thought, we should do something with this data, besides just giving it to the client, I think studying a lot of this data across hundreds of clients and hundreds of breaches would be something that the world wants to read about. And so we ended up publishing that and I led the DBIR team there for eight years, I think, and then just loved that kind of data-driven research so much that me and my chief data scientists there started Cyentia, so we get to do the same kind of thing on a larger variety of datasets.

Mike D’Agostino: [03:39]

Fantastic. Appreciate the background. That report is one of the the most widely renowned studies even to this day. One of the groundbreaking I would say data analysis studies in the cybersecurity space. Kudos to you and your team there. You’ve got a great audience today of marketers that hit that intersection of content marketing, data analysis research, that use this type of content for lots of different purposes. Why don’t we dive in? I don’t think anyone can argue how AI and more specifically, generative AI has rushed onto the scene over the past year. It’s rare that I find professional who has not created a ChatGPT, Bard or other generative AI application account, and specifically with OpenAI’s introduction of a GPT marketplace, along with some of the plugins they’d already been offering, the use cases have gone up exponentially. One of those use cases revolves around data analysis. Full disclosure. I’ve experimented with a couple of GPTs and plugins on ChatGPT that attempt to parse structured data and quite honestly, I was pretty impressed with the results. But that was working with a simple Excel file with maybe like hundred rows and five columns of dummy data, easy enough. But what about when parsing through larger datasets, which you typically get your hands on? When and how can generative AI assist? Those are some of the things we want to get into today. Wade, let me start off by asking, what, in your opinion, is the current state of generative AI usage when it comes to analyzing larger datasets?

Wade Baker: [05:36]

I will caveat what I’m saying with it, I don’t think there’s any way anybody can test everything, because this stuff is coming out. Mike, like you said, it’s unbelievable. I mean, in my history in tech, I can’t think of anything that has grown so big, so fast and generated as much energy since I remember the dotcom boom. It feels like websites coming and the World Wide Web is here, and boom. It feels very much like that. It’s impossible to keep up. However, we have looked at, using LLMs, generative AI, ChatGPT, and a lot of the others, testing them for various things across what we do. And I have to be honest, and just say that my reaction has ranged all the way from this whole thing is a farce. I don’t want to waste any more time on it, all the way to almost stammering, wow, that’s impressive. It’s hard to believe that it’s doing that well. It’s hard to get the state of it, because it, at least in my experience, it’s been so wide ranging. Maybe that’s partially due to a wide range of tasks that we’ve tried to apply these to and for maybe it’s even over the last year, you know, when I got terrible results, it was early on, and they’ve improved now. So there’s just a lot of moving pieces. But it’s a pretty wide range. But it’s fascinating. I’d describe the status fascinating for absolute sure.

Mike D’Agostino: [07:19]

Yeah, you hit the nail. That is the state, the state is in flux. It’s rapidly changing. There is no definitive current state – good, bad or otherwise, it’s definitely in flux. I saw a post on LinkedIn the other day that resonated with me, somebody said that ChatGPT or generative AI is like having an army of interns, they can do a ton of work very quickly. But you have to have oversight over everything that they do. I thought that was a great description.

Wade Baker: [07:49]

It really is.

Mike D’Agostino: [07:51]

Yeah, well, let’s talk about limitations. Because obviously, we’re not at the point yet where you could feed in gigabytes and gigabytes of data and have it spit out anything near what you and your team are producing. Where do you see the boundaries at this point of how generative AI can be useful in data analysis? Or I guess, another way to put it is, how do humans influence the usage of generative AI with regards to data analysis? You can’t just load in a dataset and say, go to town and find me correlations. Humans still need to drive it to a certain degree.

Wade Baker: [08:31]

Yeah, they do. And in fact, currently, we have not replaced our traditional data analysis with LLMs or AI or anything yet, for several reasons. One, it’s difficult to know whether the answer that you get out of it is reliable without doing a lot of the work to verify that, so if I’m faced with, okay, I’ve got a large dataset, we usually have deadlines. And I need to extract some insights and I start asking questions. I personally haven’t gotten to the point where I feel like the answers that I’m getting, I can just bank on them and start writing conclusions and drawing inferences from those because when I have looked, some of them have been off, then again, some of them have been great. That’s a real limitation on what we do, because I don’t want to publish some amazing statistic that gets a lot of press and then come to find out, you know, ChatGPT was hallucinating. That’s a fear of mine now on a different part of what we do, which I still consider under the data analysis banner is data synthesizing and summarizing, and that’s a place where we have started to see some really good results. So I’m doing research on a new topic. There’s existing research out there and a lot of findings. The historical way of doing that is I go and read all those reports and try to extract the interesting findings and understand what research has been done before and what data exists out there. I’ve seen really useful results from asking, I’m doing research on third-party risk management, go look at the latest reports and summarize those key findings. And tell me, tell me what is interesting, that kind of thing, seeing very promising results, and that’s a form of data analysis. It’s not structured data, per se, but it’s arguably, for us even a larger challenge, because it’s hard to get your hands around all of that stuff. And make it succinct in and in a timely manner. Very useful for things like that. Even summarizing existing analysis is a place I’ve found some benefits. Take a one of our reports, for instance, that has lots of very detailed charts. And we were just doing this today, in fact, it will explain the charts and the findings behind those datasets and draw out some conclusions and like a real quick summary. It’s super useful. It’s amazingly efficient at things like that.

Mike D’Agostino: [11:31]

Yeah, no doubt, and we’ll come back to some of that, and generating summaries and that sort of thing. But that first part was so on point, because even with my small trial of 100 rows, I was looking at dummy data, it was sort of like demographics information. It was pretty interesting, because it seemed like you could load in a file and say out of the 100 people represented here, how many had a title level of manager plus, and it knew that a director is higher than a manager, or VP is higher than a director, and then you have the C-suite. I did the exact same thing, because after it gave me the responses, I said, no way, that can’t be correct. I actually went back and counted every single row to make sure that it was on point, and I probably ended up spending more time validating the results than I did sort of playing around to get those results. I guess there’s a level of comfort that once you go through that, and you see that it’s producing the correct results that you get more confident that the next iteration is going to be correct. But at the same time, it takes a little bit of a leap of faith to put all of that into a generative AI to make sure that they’re going to get everything correct.

Wade Baker: [12:48]

Yep, definitely.

Mike D’Agostino: [12:50]

How about some specific use cases? Like we briefly discussed, leading up to this, you mentioned quantitative risk analysis; sounds like a scary term there. Can you describe to our audience exactly what quantitative risk analysis is? How might generative AI be of use?

Wade Baker: [13:10]

Absolutely. What we’re trying to do here is take data on security incidents; how often they happen; how much they cost; when they do occur, what types of threats are most common, contributing to those incidents? Things like cybersecurity defenders or risk managers would want to know. I’m defending against all of these threats, which ones should I pay attention to most? Which ones represent the highest risk, and usually, that boils down to some form of frequency and losses. There are datasets out there – public and private – that are lists of incidents. Recently, in the press a lot is the SEC requirement that material events need to be reported. I think that’s going to be a goldmine for these generative AI and LLM, because that’s text based information where the company is saying, hey, this happened to me, here’s what happened. That is data that would drive risk analysis, if you’re a company wanting to know, how many of these events and do they happen to organizations like yours, or in your sector. What kinds of incident patterns or threats contributed to those events? Very useful for something like that. Some of the public datasets that are out there that collect incidents and report on them and other things that is also a use case that I think has a ton of merit, because it’s just hard for humans to go and filter through all of that information and bring it back. That’s super useful. I think in terms of finding cost information. I have seen pretty good results of ChatGPT and others recognizing that what is being described here is a financial statement of loss. That’s a concept that it seems to understand. I would like that for ransomware, or for data breaches, it can it can, within reasonable accuracy, bubble up things and find data points and other information that that fit that description. From a risk quantification standpoint, having all of those data points, that’s what it is. It’s trying to collect all of that there. Some people call them priors, if they’re certain types of statisticians, but that’s the name of the game is collecting all of that information, sifting through it, understanding it, making sense of it, and then usually applying it in some kind of model to drive decisions.

Mike D’Agostino: [16:06]

It seems like we’re nowhere near the point to where we can just simply ask human-centric questions, and have it find those correlations, I’m even thinking of some of the exercises we went through with you and Cyentia when we were crunching, not machine data, but more marketing data, quite honestly, intent data, and things like making associations with content engagement related to current events, for example. That seems like a big stretch to ask ChatGPT to come up with.

Wade Baker: [16:47]

I would agree. Currently, I think it’s a stretch, but getting better from what I’ve seen. Sometimes, if you’re looking for it to collect all the data, analyze it all, and make the decision for you, that’s a pretty hard ask, but be my assistant in conducting this risk assessment, and gathering data and sort of interpreting it and those kinds of things. I think we’re there for that use case.

Mike D’Agostino: [17:23]

What about survey data? That’s a little bit of a different use case. I know, you’re very focused on product and machine type of data. But I also know that you do run some more traditional, human answered survey types of analysis. And the vast majority of our listeners, marketers that work for cybersecurity, and IT and other related vendors, they all do it. We all do; we even run our own surveys here at ISMG and CyberTheory. I tend to think I don’t know if the right term is it’s a little bit more simplistic or contained, you have one question, and you get a number of responses. All the responses are related to just that one question. Have you experimented at all with that? Are you finding some use cases with that type of data?

Wade Baker: [17:23]

I’ll be honest, and say I haven’t. We haven’t done a survey analysis in a while, but I do have it on the menu to try because I agree with you that I think that that type of data is pretty well structured, it’s usually smaller. You have certain fields that have a question that I think it can understand, and usually some kind of pick list of options. I have high hopes of that, and some of the reading that I’ve done about for instance, in ChatGPT, the Advanced Data Analysis feature, at least looks like it has the capabilities of doing that. I haven’t personally experimented with it. But that is definitely an area that once we get another survey, we’re going to give it a shot, and maybe I’ll be able to report back.

Mike D’Agostino: [18:24]

That would be great. We did our own generative AI study/survey towards the tail end of last year, and some of these custom GPTs and plugins etc., weren’t available at the time. Otherwise, I was encouraging our team to do a human analysis and then plug it into it.

Wade Baker: [19:25]

Just analyze the data and write the report.

Mike D’Agostino: [19:28]

And see how it matched up but actually, funnily enough, we have somebody on our team that’s very, very far advanced in using generative AI at least in my opinion, and we do some of these sorts of survey-based programs on a continual basis. He was experimenting with it and actually created personas that were based on the various target audiences that we were looking to accrue results from, and then ran some of the survey questions by that ChatGPT generated persona. Wouldn’t you know that responses were nearly identical to what we got from actual humans? It was actually a little scary. I said, we’re going to have to start adding a caveat to all of our survey reports, all responses were actual humans not GPT personas.

Wade Baker: [20:44]

I think that’s pretty cool. I’m experimenting to see if it can write with Snark and 80s pop culture references and things. So I can just quit and retire and let it do that work for me.

Mike D’Agostino: [21:00]

Have an army of GPTs for you? You mentioned earlier, and I bring this up, because we have a lot of content marketers that listen in. Like I mentioned before, many of them, as you know, are using research and survey reports, as part of their buyers’ journey, content marketing plans. Talk a little bit more, you touched on it, but it seems like one of the more obvious use cases is kind of generating content, write summaries, and sort of like building narratives around like found correlations and data points, it seems like you’ve had some success in that department.

Wade Baker: [21:45]

I’m not a marketer, but I work with many and at least generally have an understanding there. I’ll give a use case that seems interesting to me, hopefully, it will be to you as well. Let’s talk about competitive research analysis. You’re in a space and you want to publish content, and you want to publish content that doesn’t just mimic what your competitors say, and tread over the same data points, and things like that. And at the same time, you need to know what their data points and main findings and other things are, I think this is an excellent use case. You can pull a body of reports, and you can feed it in; I mean, we’ve experimented with lots of different ones out there that can take a PDF document that you download, you know, just like everybody downloads and reads it and you can kind of take that corpus of information and begin asking questions, hey, what are the main findings of this? Do they do anything about this thing, and tell me what the trends are in this space based on this set of documents, and it could be for one competitor, it can be for all the other competitors in that space. But I think that’s a really cool use case for quickly getting up to speed on data analysis and research that’s been published by others that you kind of want to get a bead on.

Mike D’Agostino: [23:06]

Yeah, right on the mark. And we’ve tried some of the services that have popped up that, like, I think one of the great use cases is taking a meaty survey report that could be you know, 2030 plus pages, and creating sort of like variations on that smaller bite sized chunks. So taking a larger report, and then breaking it out, we’ve tried some of the services that attempt to do that, take a piece and then take one comprehensive piece and produce five or six content marketing assets from it. We haven’t really found any so far that are just completely plug and play and produce results that we thought were worthy, but experimenting on our own and experimenting with the prompts that we’re asking, we’ve been able to produce some decent, sort of smaller bite sized chunks based on a larger compendium. So we have seen some success there. Just a last question for you here. I know this is kind of on the edge, but just to play the speculation game, any thoughts, being knee deep in data analysis on how this might pan out in the next year, five years? 10+ years? Anything that you see coming down the horizon that you feel would make it all kind of come together?

Wade Baker: [24:37]

There’s some there’s some places where we spend a lot of time doing what we do. Data cleaning is one of them, I think that might be a place where, do X, Y and Z to this data set and that it could be automated and AI could be pretty good at that. Basic exploratory analysis. That takes a lot of time, but it’s kind of repetitive, and it’s a lot of, okay, I’m going to take this column of data, and I’m going to do these functions, and then I’m going to do it to the next column of data and do these functions, and you’re just trying to get a sense of what the data looks like things like that, I think you’re not too far out, that’s not too far speculation that I think it will become very useful in that domain, and then, you know, to sort of flip the script, I also think that guiding data analysts is a possibility, like, we have a challenge in what we do, and that I need data analysts that are not only good at analyzing data, but they also understand the security industry well, to know what to look for in the data. I kind of wonder if maybe I can sort of outsource, if you will, the AI for security contextual understanding, and say, hey, we’re trying to do this, what are some key findings or insights or whatever that we should look for in the data, and there’s some leads given, and then analysts can go find that, so, almost sounds different, like it’s inverse of what many people think about, but I think there’s some possibilities there.

Mike D’Agostino: [26:36]

Wade, you stole my thunder, I was going to bring up that exact fact that, I think you can start to rely more and more on these systems for doing some of the data analysis and crunching the numbers. Like I alluded to, for shorter form survey data that are very contained. Yes, you can get some of the answers that you’re looking for. The one thing that just seems so difficult to bake in is context. And that to me is the true differentiator and why I make the statement that I don’t think you should be fear ChatGPT taking over your job anytime soon, because you have that cybersecurity context. You’re, you’re an insider, you understand beyond the data, the different types of correlations and different things that can bubble up through that data analysis that I don’t think even right now, if you’re training some of these GPTs, you’re imparting some persona attributes to it. It’s not at the point to where it’s going to have that 20-30+ years of context that you can apply to the analysis and that to me is the key differentiating factor, at least right now.

Wade Baker: [27:56]

Yep, I’m right there with you.

Mike D’Agostino: [27:59]

That’s where you come into play. This has been a great conversation. Hopefully, it was insightful for our listeners here. Like I said, I know we have lots of content marketers, most of the vendors out there. They’re using research and survey data. And I’m sure they’re all looking for shortcuts. Unfortunately, we don’t have too many answers for them right now. You still need that contextual insight to understand how to draw out some of those data points and correlations. But who knows, as we started off with everything is in flux so fast these days, a month from now, six months, a year, two years from now, maybe we’ll be in a different place. But appreciate your insights. Thanks for joining the show. Hopefully we can get an update from you sometime soon. Thanks, Wade. Thanks, everyone, for listening to another episode of Cybersecurity Marketing Unplugged. I’m your host, Mike D’Agostino and catch you next time. Thank you.