Unless you’ve been living under a rock, you’ve probably noticed that AI is becoming increasingly integrated into all aspects of life, including online search.
When you search for something on Google, more likely than not, an AI Overview will be the first thing to pop up. Beyond AI Overviews, people are turning to large language models (LLMs) like ChatGPT for their search needs, as they start to prefer direct, synthesized answers over poring through a search engine results page (SERP).
With this information in mind, you’re probably wondering: where does AI get its data from, and how can I be part of it?
This isn’t just top of mind for you, but for everyone who wants to use these technological advancements to drive growth for their brand.
In this piece, we’ll lay out exactly how to be included in AI answers, and trust us; We’re the experts here.

Before we tell you how to be included, you need to know what’s included.
This is the most common source for large-scale, general-purpose AI models. It’s collected using web crawlers or web scrapers that automatically browse and extract data from a huge variety of websites.
This includes text from sources like Wikipedia, news articles, and even conversations users are having on social media platforms and forums like Reddit. The sheer amount of this data provides a broad foundation for LLMs, but its unstructured nature means it can be messy, and the biases present in public discourse can possibly be perpetuated by the AI model.
These are datasets that have been intentionally created and released by research institutions, governments, or organizations for public use, often structured and labeled.
Notable examples include ImageNet, a massive dataset of over 14 million images used to train computer vision models, and the C4 dataset, a text collection derived from the Common Crawl web scrapes. The availability of these datasets is a cornerstone of the open-source AI community, allowing researchers and developers worldwide to build upon each other's work without having to collect data from scratch.
Companies often use their own internal data to build specialized AI models that give them a competitive edge. This can include valuable information like customer service logs and purchase history. Additionally, some companies, like Getty Images, license their copyrighted content to AI developers for a fee.
The value of this data lies in its exclusivity, but its use brings up legal and ethical challenges, particularly concerning data privacy and ownership.
Every time you interact with AI, you’re potentially providing new data it trains itself on. The prompts you use in an image generator, the questions you ask a chatbot, and the corrections you provide to its output are all forms of data. This process, often called a "feedback loop," helps refine the model's performance and align it with each user’s expectations.
Answer Engine Optimization (AEO) is the practice of optimizing your content to be cited in an answer by LLMs like ChatGPT, Gemini, and Perplexity, and SERP features like AI Overviews. Instead of competing to rank on a SERP, the goal is to be the authoritative source that AI uses to formulate a response, requiring a shift in how you structure and present information.
To make your content a top choice for AI models, you must focus on delivering clarity, authority, and structure in your content.
To help with this, you can use Goodie, which offers AI visibility and optimization recommendations for your AEO efforts. Goodie can even help you write content if you’re not sure where to start.

We previously outlined the types of data that AI models are trained on and used to generate answers. You can influence this by creating targeted content to become an authority in a specific niche.
This is the foundation of generative AI’s outputs. To be visible here, you need to create comprehensive, well-structured content that covers broad topics.
To do this, you can use a hub-and-spoke model:
By building this interconnected web of content, you’re helping AI understand the relationships between your content, recognize your site as an authoritative entity on the topic, and ultimately, choose your content to cite in its answers.
Although these datasets are usually private and inaccessible to the public, you can create content that mirrors the structured, expert-level data an AI would seek out.
You can do this by embracing the “living document” strategy:
This strategy involves creating unique, data-rich content on your website that you continuously update. This signals to AI models and search engines that your site is a reliable and up-to-date source of information.
By adopting this strategy, you're gaining a competitive advantage in AEO as you position your brand as a leading authority in your niche, increasing the likelihood of being cited. And if you need some extra support in this effort, Goodie can provide you with analytics to track how your content is performing in AI platforms.
AI crawlers index forums, social media, and review sites to understand human language and intent, which you can target by actively participating in these spaces.
Here’s how to effectively shape conversations on these platforms, increasing your likelihood of being cited:
By strategically engaging on these platforms, you're killing two birds with one stone; you're providing value to users while simultaneously influencing the AI crawlers that scrape these public discussions.

AI models are changing the very nature of search. The days of competing to rank on a SERP are being replaced by the need to be the source that an AI uses in its answers.
By creating content that is clear, authoritative, and well-structured, you are not just building a blog; you’re building a trusted dataset.
The future of AI depends on its data. Your choice to be an intentional, high-quality source of data ensures that the AI generates accurate, high-quality results.
If you're ready to start building a dataset of your own, consider Goodie for these efforts.