How Will ChatGPT Change The World?
Let’s about the tech/internet world’s favorite craze: ChatGPT. OpenAI’s ChatGPT API spurred the development of a number of apps – from Google Sheets extensions to Slack to…just about everything. Candidly, it seems like one can’t go anywhere without hearing about ChatGPT, so let’s talk about it.
Everyone in tech (and quite a few people outside of it) has gotten *very* excited that Large Language Models (LLMs) represent an inflection point in AI. Microsoft (smartly) threw jet fuel onto that fire by investing $10B into OpenAI, promising to integrate ChatGPT-esque functionality into search, office, Outlook and more.
It’s pretty clear that while the next iteration – more step change, less paradigm shift – of AI/ML has started with a bang, three critical considerations seem to have been lost:
1. This isn’t new or unique
Generative LLMs are simply the reverse of how most machine learning models operate. Models like this have been around for a few years, in different forms + contexts, and with varying levels of complexity – from chatbots to ImageNet, and more recently, Jasper, Stable Diffusion, MidJourney & DALL-E / DALL-E 2.
The short, simplified version: ML models take “easy for people, hard to describe to machines” logic problems and turn them into statistics problems. There’s no need to “teach” a computer what’s an arm vs. a leg vs. a face – that’s a wildly, mind-bogglingly difficult thing to do. Instead, we give a computer a million pictures, categorized as arms, legs and faces to train itself (similar, in some ways, to how we learn). We then show the trained model a photo of one of the three, and it predicts with a relatively high degree of accuracy which classifier to apply (or if it’s none of the above). The key point here is the model doesn’t know what an arm or a face is; it just recognizes the pattern and delivers a result.
Generative LLMs are the reverse: we provide it a question, and it “generates” content that matches the pattern of what other things within its corpus of knowledge tend to exhibit. It doesn’t “know” the answer any more than a calculator knows the concept of 2 + 2 or ImageNet “knows” what a cat is. A LLM simply crafts a response within the parameters specified that follows the pattern it has been trained on. That’s exactly what happened with Stable Diffusion, where it copied “Getty” watermarks into millions of generated images – because it doesn’t understand what it is making. It just knows that the training data had a specific pattern in it (a pattern we’d call a “watermark”), so it created a novel image responsive to the prompt that fit the pattern – which happened to include the watermark.
The end result: the veracity of an LLM-generated response is still very much in question (and very much subject to the quality + comprehensiveness + nuance of the training set). To use an analogy, ChatGPT is an (over)confident high school sophomore, not Jarvis or the Supreme Intelligence.
In a vacuum, that’s all well and good – there’s nothing inherently wrong with being a high school sophomore and being confidently wrong in some domains. In other domains, a trusted resource being confidently wrong can have disastrous implications for a person’s health, wealth & well-being (which is why Google introduced YMYL years ago).
The question is: how do you teach a LLM which is which – what queries are subjective and which are objective, and which are a bit of which? Then, how do you simultaneously build the authority of a LLM to a user base to the point where they’ll switch to your platform (Bing) while mitigating the risk inherent in providing answers vs. resources?
So, why is ChatGPT different? There are two ways to answer this: the first is that ChatGPT is one of the first public models to include a “human” feedback loop, where actual people ranked responses on a scale of more to less natural – the outcome being that ChatGPT, unlike other LLMs, actually sounds human. That’s no small feat – producing responses that are decently accurate, appear compelling / plausible and are responsive to a wide range of query types (from summarizing content to writing code to solving equations) is an incredible thing. It is not my intention to discount that in the slightest – it’s a truly remarkable achievement.
The second is that ChatGPT has incorporated a significant number of parameters to “raise the floor” of result quality and prevent it from assimilating false, racist, xenophobic and other content into its knowledge base – something that both Microsoft (remember Tay) and Meta (Hi, BlenderBot3) failed to do.
The third – and perhaps most critical – component is that ChatGPT gained rapid adoption through a wonderfully simple, intuitive interface. For most of the past 3-5 years, LLMs have been restricted to fringe corners of the internet or obscure platforms that most people simply didn’t know about. ChatGPT put the functionality front-and-center, opened it up at a perfect time (right after Thanksgiving) and has been rewarded with the fastest-ever growth to 100M total visitors[1]. But despite the success and adoration of the innovation-happy masses, ChatGPT is not perfect – far from it. In the ~100 days since its launch, there’s been a near-endless barrage of articles, takes and prognostications on the implications of ChatGPT on everything from marketing departments to higher education to newsrooms. Some of that is real. Much of that is overblown.
2. There are significant limits that aren’t obvious
One of the most important is that LLMs have two types of optimization events – the query + the response – and scaling those feedback loops is a deceptively tricky thing. This dovetails nicely into search – specifically: (1) how many searches are ones for which a LLM-generated response is (a) appropriate and (b) sufficiently accurate for the domain of the initial query; (2) of the responses produced, to what degree are they original vs. derivative (remixes from the existing corpus) and (3) what is the feedback loop that enables the system to learn while preventing corruption in the corpus/index from being accelerated by the system?
To give an example: AlphaGo (later AlphaZero) was able to make remarkable progress + eventually develop original moves – but only after playing millions (~4.9M) of games against itself, with a built-in feedback loop (win/loss). In search, the feedback loop is not nearly as clean or apparent as it is in Go or Chess or even Poker – what constitutes a successful result? Is it query acceptance (and how do you know if the user goes to Google?) What does response rejection look like (and how is that different from query refinement?) Does the user matter in crafting a response? How do you tell the difference?
These are maddeningly difficult questions, many of which are domain + query + user specific – being mostly correct about a summer of the 76ers game last night is far more acceptable to me than it would be to an avid sports bettor (user-specific). And being wrong on either of those 76ers examples is far less problematic than being mostly correct about how a particular clause in a contract functions or how to execute a Long Straddle or which drugs are indicated and safe for a given patient profile or any other of a mind-blowing large set of high-impact queries. In the same vein, there are some domains where a ChatGPT article being 95% or 97% correct is wonderful and the 3-5% error doesn’t matter. There are others where the 3-5% wrong is all that matters, and being wrong in those areas renders the 95% to 97% that was correct absolutely irrelevant.
Google has (mostly) solved this using their SERPs as a point of leverage – the search engine creates the index + algorithmically returns 10 blue links (+ features like the knowledge panel), and the aggregated data from billions of users interacting with trillions of SERPs (yes, with a T) is used[2] to incrementally refine those results over time. And even with trillions of data points, the world’s most sophisticated filtering + curation systems and a data moat that dwarfs any previously seen in human history, Google still struggles to deliver excellent results with consistency in certain domains.
Now, if beating that seems daunting, remember that what generative search is trying to do is exponentially more complex – it isn’t just trying to find the optimal result, it’s trying to create it. That’s not only massively complex, but it ignores much of the nuance + possibility around queries – something that not even MidJourney (which provides 4 options) and Google Search (10 blue links + features) attempts to solve.
The second challenge – and arguably the more impactful one – is the fact that LLMs are statistical engines that do not understand underlying concepts. ChatGPT doesn’t understand what marketing is, any more than it understands what an arm is or that the Flying Spaghetti monster is a satirical dig at organized religion. It simply understands the pattern, and it generates content that fits the pattern.
A related challenge: the pattern is created based on the data fed into it[3] – and when that corpus is the internet, well, yikes. Again, Google circumvents this problem by providing multiple options + observing behavior – allowing people to make the ultimate decision. ChatGPT doesn’t have that luxury (at least, not yet) – and that creates situations where (a) it crafts responses that are obviously wrong in ways that are obvious to a certain type of person (i.e., a subject-matter expert or someone not in a cult, for instance), but not to a machine or another type of person (a novice or a cult member). That only gets more problematic when LLM-generated content begins to be incorporated within the index itself – errors multiply and truth is lost in the Library of Babel.
3. Speaking of creation
There is very much a Library of Babel problem[4] with Generative LLMs: ChatGPT (and its successors) could create every masterpiece the world has ever known, but how would it know when it’s created something remarkable vs. something mundane vs. something nonsensical (this is an iteration on the problem with the Infinite Monkey Theorem).
This ties into a broader point: LLMs are ill-equipped to make things that are truly “creative” or “disruptive” – at least, as we conceive of those concepts. The reason is that true creativity tends to come not from following a pattern, but in breaking the pattern – and not just breaking it, but breaking it in a one of a few specific, often-counterintuitive ways. Part of that can be solved by asking the right questions of the LLM (prompt engineering) – but part of it is something far more complex, intuitive and (to a degree most marketers + creatives don’t want to admit) lucky. It’s creating a magical moment of resonance with a latent need or a not-clearly-expressed frustration in the collective Zeitgeist.
The Upshot: This is all well, good and potentially interesting – but what are the likely impacts of ChatGPT (and the interactions / LLMs that are likely to follow it)? In my view, most of the ChatGPT craze has been mis-focused on search – which is (ironically) the domain *least* likely to be impacted by ChatGPT in the short-term (for some reasons I’ve highlighted below). Instead, I think it’s helpful to view ChatGPT as an enabling layer, similar to databases and mobile networks – the unseen infrastructure that makes our much-more-productive reality possible.
- LLMs As Productivity Multipliers – Not People Replacers. The most likely near-term use case may be a productivity enhancer – replacing some of the time-consuming tasks inherent in creating original content (like idea generation, or historical/background research, or summarizing detailed papers). In the right hands, and with the right prompts, ChatGPT (and other LLM technology, like Google’s BARD) are productivity multipliers. This is another classic case, as with automation in financial markets + advertising, where smart people and smart machines together create an amalgamation – a result greater than the sum of its parts. An example of this is GitHub’s CoPilot – a productivity enhancer that integrates naturally into workflows + allows coders to be exponentially more effective.
It’s easy to imagine ChatGPT as a co-pilot of sorts, that works with you across devices, automating away mundane + rote tasks and enabling you (the user) to spend more of your time doing highly-productive, zone-of-genius work. - Content Goes Commodity – One thing LLMs do particularly well is produce perfectly fine, mostly acceptable content. The implications of that are significant in some areas of marketing – if for no other reason than ChatGPT raises the floor for what “acceptable” content looks like. To illustrate this point, imagine if we plotted all of the internet’s indexed content on a bell curve – with truly horrible at the left end, masterpieces on the right and the vast majority falling in the middle. Today, that curve likely looks like a left-skewed distribution (there’s LOTS of bad content on the internet).
It’s likely that ChatGPT shifts the distribution in two ways: (1) it moves the mean a little closer to the right & (2) makes the center substantially higher. For some brands (especially those with limited ability to produce content), that will be wildly helpful and enable them to produce significantly more content of acceptable quality. For others with a vast original content production engine, ChatGPT will have a minimal impact (largely confined to research + rote tasks) – because generative content is limited in its ability to create original content. And for those who are currently making a living creating mediocre content, well, time’s up. - Content Quantity Does Not Drive Quality – the second implication of the commoditization of content is that the quantity of content production is likely to increase exponentially. Currently, the limiting factor to content creation is (almost always) time – we have plenty of ideas for articles and posts to write, we just don’t have the time to write them (much less, write them well). ChatGPT solves that. The issue is that quantity does not drive quality – remixing a ton of otherwise mediocre content using somewhat novel prompts is about as likely to result in a masterpiece as me throwing paint on a canvas is likely to produce the next Mona Lisa. What it will do (as noted above) is exponentially increase the amount of “meh” content out there.
The second-order effect of the exponential increase in content created is a corresponding increase in content filtering – Stanford & Google (among others) have already developed sophisticated models to detect + filter LLM-generated content. There’s strong evidence that Google has already integrated AI-generated content filters in their December, 2022 helpful content update. If Google (with its 93% market share for search) is serious about culling artificially-generated results, how helpful ChatGPT can be to marketers (including content marketers + SEOs) is limited. There are certainly ways to fool online detectors, but Google has a massive financial incentive to get this right – crawling + indexing the internet is a staggeringly expensive proposition as it is – and if Google can’t accurately filter out the types of content it doesn’t want, the most likely outcome is increased costs + lower marginal utility (and thus, lower revenues) from search. - The most obvious question is the question of monetization. Search is a cash cow for Google (to the tune of ~$200B/yr) – and they are rightly taking any threat to it as an existential, do-or-die matter. Microsoft is clear that they view ChatGPT (and LLMs like it) as a lever to capture incremental search market share.
That being said, it isn’t obvious how monetizing an LLM-generated response (for instance, by including content provided by an advertiser or a mention of a specific product into the response text, where appropriate) is materially different from monetizing a SERP.
In the short-term, even the above is far-fetched – with the more likely solution being an LLM-generated response replacing the Featured Snippet at the top of a traditional SERP.
Yes, the creative may change – but that’s been a constant in the search ecosystem for 20+ years, ranging from ad location to ad types (DSAs & RSAs) to ad extensions, etc.). All the while, the core concept – paying for the ability to reach a specific searcher at a given point – remains unchanged. The functionality underpinning that will certainly change (for instance, how does self-serve ads work in an LLM-driven search engine?), but all of this begs the question of why Google can’t do the same thing (they can) and why one of would think they haven’t war-gamed this for years? - Search Augmentation vs. Replacement – the idea that ChatGPT is the “Google Killer” or “end of search” seems to rest on an incomplete (and flawed) understanding of how search works. ChatGPT is a staggeringly expensive, but predominantly static LLM trained on an outdated snapshot of a segment of the internet. What does it cost to train ChatGPT in real-time on a continuously updating index of the web? I don’t know – but I’m quite confident it’s orders of magnitude more expensive than the estimated $5M/day (it was $100k in early December at just ~1M users). How do you teach a real-time updated model of the internet how to weight content source vs. recency? There are certainly some queries where recency doesn’t matter – and there are certainly some where it absolutely does.
Finally, personalization is a critical component in search that gets lost in LLMs. If I ask ChatGPT where’s the best pizza, it has no idea of my preferences, order history, location, etc. – all of which are required to return a relevant result or listing of results. There’s no way for it to know if it is recommending pizza shops that offer the kinds of pizza I like (brick oven, please), or if the pizza shops are even good (Dominos is not). Likewise, if I’m a subject matter expert, I will expect a far different level of complexity and sophistication in a response than someone who is just learning about a field. In search, my browsing history can be used as a signal (among many others) regarding what results to surface; in an LLM, it’s less obvious how that will work (and if that will work). - Where does ChatGPT fit in search? There are certain types of searches which are appropriate for an LLM-generated response – and quite a few that aren’t. Part of this is a question about query types – the ones where ChatGPT can be useful are the “known-knowns” and (to a lesser extent) the “known unknowns” – the things we’re aware of and understand or don’t, respectively. It’s less obvious how helpful ChatGPT is for the “unknown knowns” and the “unknown unknowns” – the things we understand but aren’t aware of (so we search) or the things we neither understand nor are aware of.
What we can be relatively confident in is that one outcome from ChatGPT is that a lot of sites that no longer deserve to win traffic from search, but continue to do so based on lack of competition, will find that their traffic dries up rapidly (and by extension, so too does their revenue from Display placements). This will likely be concentrated in the ”known-known” queries – but could bleed into other types as well. This has been happening for years with Google’s Featured Snippet (aka the “Answer Box”), and this will simply accelerate with the shift from “excerpted response” to “generated response.” It is not new. Yes, that means the companies and individuals behind these terrible experiences will lose. No, I’m not sad to see them go – and I’m sure, neither are you. - The Opportunities in Search for ChatGPT – There’s a certain class of search – like comparisons (“Does LA or SF have a larger population?”) or compound-conditional informational searches (“which teams have won more than 3 World Series since 1990?”) where ChatGPT is uniquely well-positioned to shine, and where the current solution (read: Google) is bad to horrid. What’s less clear is the marginal utility of winning that specific domain, vs. the marginal utility of consistently high-quality results (defined as timely, relevant, responsive, authoritative, accurate & comprehensive) – which Google does a surprisingly decent job of achieving for the vast majority of queries.
We know that search admits of network effects (something Google has used as a point of leverage for 20+ years) – what is less clear is whether the marginal increase in utility for some queries is sufficient to dislodge Google’s (absolutely massive) advantage in the category. - Parallels with Voice Search: there are a number of interesting parallels to today’s ChatGPT craze and the 2018-19 “Voice Search Mania” – when it seemed that connected devices were poised to take over the world and keyboards would be a thing of the past. Of course, that never materialized (or, at least, it didn’t materialize in the way experts thought it would) – despite the confident predictions of many in tech. Ironically, one of the primary reasons voice search failed to displace traditional search (it’s stupendously difficult to consistently parse precisely what people want from what they say in a single domain, let alone a generalized context) is eerily similar to the ChatGPT limitations discussed above. I’m not saying history will repeat itself here, but it has been known to rhyme.
- Legal & Regulatory Considerations – All types of Generative LLM models are opening entirely new legal + regulatory questions – from what it means for a model to “access” content (and it’s pretty obvious that this is a fundamentally different question from what it means for a search engine to crawl your website + include snippets of it in SERPs), what this means for copyright (can a LLM copyright results? Can a searcher who incorporates LLM content into their original works copyright them?), and what the regulatory regime around generative ML becomes – are there fees + compensation paid to content creators whose work is fed into LLMs (and if so, how does that skew the LLM model?).
This sets aside the liability for LLM-generated results – but ultimately, that’s a massive set of questions that we (collectively) will need to reckon with. Who is responsible if an LLM creates a response that leads to real-world harm? Can companies be held liable if an LLM included on their site points someone in the wrong direction or plagiarizes work? Are companies able to protect content generated by a LLM? To what extent? Do the creators of “seed” content used in LLMs have rights to future profits generated by the LLM?
More recently, China announced a massive crackdown on computer-generated media – and they are certainly will not be the last. These are not simple questions, and every answer comes with tradeoffs that will need to be weighed by serious, well-informed and highly knowledgeable people.
The bottom line is this: we’re still figuring out the domains where ChatGPT will disrupt – which ones have the right combination of specificity, error tolerance (and tolerance for the types of errors that LLMs are likely to make) and depth – to enable LLMs to create something remarkable or find a latent pattern that normal people (or even experts) would miss.
ChatGPT is certainly enchanting + exciting – both as a platform and as a precursor to the things we hope that AI/ML can one day accomplish. Like every platform, it has very real-world applications (some of which I mentioned above) and very real limitations.
The key – for marketers, creatives, business owners, executives & regular people – is discovering (through trial + error) the right points of leverage where LLMs can have an outsized impact on our productivity & happiness. I don’t know what those are yet, but I’m excited to find out.
[1] According to a SimilarWeb study, which found ChatGPT reached ~100M visitors in January. DAU + MAU counts are still unclear.
[2] In conjunction with other factors – we’re not getting into a detailed discussion of ranking factors here.
[3] Bias in AI is a staggeringly large + thorny issue that receives relatively little coverage – but one that will undoubtedly become more prominent as LLMs (and similar ML-driven content engines) become more popular.
[4] If you’re unfamiliar: the Library of Babel is a conceived universe in the form of a near-endless library, filled with every possible permutation of a certain character set – meaning it contains every book ever written, a perfectly accurate prediction of every future event, etc. – and a whole lot of nonsense…and thus is completely useless