How Search Engines Work

Do you wonder How Search Engines Work? In this article, I will feature how google search works. There is a short version and a long version of how Google search works. Search Engines gets information from many different sources, including:

  • Web pages,
  • User-submitted content such as Google My Business and Maps user submissions,
  • Book scanning,
  • Public databases on the Internet,
  • and many other sources.

However, this page focuses on web pages. Let’s understand some keywords here before continuing!

How Do Search Engines Work?

Search engines work through three primary functions:

  1. Crawling: Scour the Internet for content, looking over the code/content for each URL they find.
  2. IndexingStore and organize the content found during the crawling process. Once a page is in the index, it’s in the running to be displayed as a result of relevant queries.
  3. Ranking: Provide the pieces of content that will best answer a searcher’s query, which means that results are ordered from most relevant to least relevant.

How Search Engines Organize Information

Before you search, web crawlers gather information from across hundreds of billions of webpages and organize it in the Search index.

The Fundamentals Of Search

The crawling process begins with a list of web addresses from past crawls and sitemaps provided by website owners. As Search Engine crawlers visit these websites, they use links on those sites to discover other pages. The software pays special attention to new sites, changes to existing sites, and dead links. Computer programs determine which sites to crawl, how often and how many pages to fetch from each site.

Search Console gives site owners granular choices about how Google crawls their site: they can provide detailed instructions about how to process pages on their sites, can request a recrawl, or can opt-out of crawling altogether using a file called “robots.txt”. Google never accepts payment to crawl a site more frequently — we provide the same tools to all websites to ensure the best possible results for our users.

Finding Information By Crawling

The web is like an ever-growing library with billions of books and no central filing system. Search Engines use software known as web crawlers to discover publicly available web pages. Crawlers look at web pages and follow links on those pages, much like you would if you were browsing content on the web. They go from link to link and bring data about those web pages back to Google’s servers.

Organizing Information By Indexing

When crawlers find a webpage, google’s systems render the content of the page, just as a browser does. It takes note of key signals — from keywords to website freshness — and keeps track of it all in the Search index.

The Google Search index contains hundreds of billions of web pages and is well over 100,000,000 gigabytes in size. It’s like the index in the back of a book — with an entry for every word seen on every webpage we index. When we index a webpage, we add it to the entries for all of the words it contains.

With the Knowledge Graph, Google is continuing to go beyond keyword matching to better understand the people, places, and things you care about. To do this, we not only organize information about web pages but other types of information too. Today, Google Search can help you search text from millions of books from major libraries, find travel times from your local public transit agency, or help you navigate data from public sources like the World Bank.

How Search Algorithms Work

How Search Algorithms Work

With the amount of information available on the web, finding what you need would be nearly impossible without some help sorting through it. Google ranking systems are designed to do just that: sort through hundreds of billions of web pages in our Search index to find the most relevant, useful results in a fraction of a second, and present them in a way that helps you find what you’re looking for.

These ranking systems are made up of not one, but a whole series of algorithms. To give you the most useful information, Search algorithms look at many factors, including the words of your query, relevance and usability of pages, expertise of sources, and your location and settings. The weight applied to each factor varies depending on the nature of your query—for example, the freshness of the content plays a bigger role in answering queries about current news topics than it does about dictionary definitions.

To help ensure Search algorithms meet high standards of relevance and quality, Google has a rigorous process that involves both live tests and thousands of trained external Search Quality Raters from around the world. These Quality Raters follow strict guidelines that define our goals for Search algorithms and are publicly available for anyone to see.

Learn more below about the key factors that help determine which results are returned for your query:

  • Meaning of your query
  • Relevance of webpages
  • Quality of content
  • Usability of webpages
  • Context and settings
  • Meaning of your query

To return relevant results for your query, Google first needs to establish what information you’re looking forーthe intent behind your query. Understanding intent is fundamentally about understanding language and is a critical aspect of Search. Search Engines build language models to try to decipher what strings of words they should look up in the index.

This involves steps as seemingly simple as interpreting spelling mistakes and extends to trying to understand the type of query you’ve entered by applying some of the latest research on natural language understanding. For example, our synonym system helps Search know what you mean by establishing that multiple words mean the same thing. This capability allows Search to match the query “How to change a lightbulb” with pages describing how to replace a lightbulb. This system took over five years to develop and significantly improves results in over 30% of searches across languages.

Beyond synonyms, Search algorithms also try to understand what category of information you are looking for. Is it a very specific search or a broad query? Are there words such as “review” or “pictures” or “opening hours” that indicate a specific information need behind the search? Is the query written in French, suggesting that you want answers in that language? Or are you searching for a nearby business and want local info?

A particularly important dimension of this query categorization is an analysis of whether your query is seeking out fresh content. If you search for trending keywords, our freshness algorithms will interpret that as a signal that up-to-date information might be more useful than older pages. This means that when you’re searching for the latest “NFL scores”, “dancing with the stars” results, or “exxon earnings”, you’ll see the latest information.

See also  Apple iOS 14 Introduces New Ways to Customize the Home Screen, Discover And Use Apps with App Clips, And Stay Connected In Messages

What is RankBrain?

RankBrain is the machine learning component of Google’s core algorithm. Machine learning is a computer program that continues to improve its predictions over time through new observations and training data. In other words, it’s always learning, and because it’s always learning, search results should be constantly improving.

For example, if RankBrain notices a lower ranking URL providing a better result to users than the higher ranking URLs, you can bet that RankBrain will adjust those results, moving the more relevant result higher and demoting the lesser relevant pages as a byproduct.

Engagement metrics: correlation, causation, or both?

With Google rankings, engagement metrics are most likely part correlation and part causation.

When we say engagement metrics, we mean data that represents how searchers interact with your site from search results. This includes things like:

  • Clicks (visits from search)
  • Time on page (amount of time the visitor spent on a page before leaving it)
  • Bounce rate (the percentage of all website sessions where users viewed only one page)
  • Pogo-sticking (clicking on an organic result and then quickly returning to the SERP to choose another result)

Many tests, including Moz’s own ranking factor survey, have indicated that engagement metrics correlate with higher ranking, but causation has been hotly debated. Are good engagement metrics just indicative of highly ranked sites? Or are sites ranked highly because they possess good engagement metrics?

Useful Responses Take Many Forms

Larry Page once described the perfect search engine as understanding exactly what you mean and giving you back exactly what you want. Over time, testing has consistently shown that people want quick answers to their queries. Search Engines have made a lot of progress on delivering you the most relevant answers, faster and in formats that are most helpful to the type of information you are seeking.

If you are searching for the weather, you most likely want the weather forecast on the results page, not just links to weather sites. Or directions: if your query is “Directions to San Francisco airport”, you want a map with directions, not just links to other sites. This is especially important on mobile devices where bandwidth is limited and clicking between sites can be slow.

Thousands of engineers and scientists are hard at work refining algorithms and building useful new ways to search. You can find some of our Search innovations below. With some 3234 improvements to Google Search in 2018 alone, these are just a sample of some of the ways we have been making Search better and better over time.

Deliver The Most Relevant And Reliable Information Available

To help you find what you’re looking for, Search Engines consider many factors, including the words in your question, the content of pages, the expertise of sources, and your language and location. Every day, fifteen percent of searches are new, so Search Engines use automated systems to get you the most relevant and reliable information we can find.

To measure whether people continue to find results relevant and reliable, Search Engines have a rigorous process that involves extensive testing and thousands of independent people around the world who rate the quality of Search.

The Knowledge Graph

Directions and traffic

Direct results

e.g. Sundance Showtimes

Sometimes you want direct results for certain queries so google teams up with businesses that can deliver the information and services you are looking for and license their content to provide useful responses right on the Search results page. For instance, if you’re looking for the showtimes of movies at your local cinema, google partners with data providers that have up to date and reliable information about when films are showing in your area and with ticketing service providers to help you buy tickets. This is also how search engines can bring you the weather forecast and sports scores directly on the Search results page.

Featured snippets

e.g. When was the 21st amendment passed in the U.S?

When you ask Google a question, their goal is to help you find the answer quickly and easily. Featured snippets help provide quick answers to questions by drawing attention to programmatically generated snippets from websites that their algorithms deem relevant to the specific question being asked. All Featured snippets include a snippet of the information quoted from a third-party website, plus a link to the page, the page title, and URL.

Rich Lists

e.g. Famous female astronomers

The best answer to your question is not always a single entity, but a list or group of connected people, places, or things. So when you search for [California lighthouses] or [famous female astronomers], search engines will show you a list of these things across the top of the page. By combining their Knowledge Graph with the collective wisdom of the web, the search engines can even provide lists like [best action movies of 2018] or [things to do in Rome]. If you click on an item, you can then explore the result more deeply on the web.

Explore your interests with Discover

Even when you don’t have a specific query in mind, you still may want to be inspired by the things you care most about. That’s why Google built Discover. Found in the Google app, on Android home screens, and on Google’s mobile homepage, Discover is a personalized feed that helps you explore content tailored to your interests. You can also customize the experience by following topics and indicating when you want to see more or less of a particular topic.

Evolving to meet the ever-changing web

The web is constantly evolving, with hundreds of new web pages published every second. That’s reflected in the results you see in Google Search: Search Engines constantly recrawl the web to index new content. Depending on your query, some results pages change rapidly, while others are more stable. For example, when you’re searching for the latest score of a sports game we have to perform up-to-the-second updates, while results about a historical figure may remain static for years at a time.

Today, Google handles trillions of searches each year. Every day, 15% of the queries process are ones they’ve never seen before. Building Search algorithms that can serve the most useful results for all these queries is a complex challenge that requires ongoing quality testing and investment.

The Short Version Of How Google Search Works

The Short Version Of How Google Search Works

Google follows three basic steps to generate results from web pages:

Crawling

The first step is finding out what pages exist on the web. There isn’t a central registry of all web pages, so Google must constantly search for new pages and add them to its list of known pages. Some pages are known because Google has already visited them before. Other pages are discovered when Google follows a link from a known page to a new page. Still, other pages are discovered when a website owner submits a list of pages (a sitemap) for Google to crawl. If you’re using a managed web host, such as Wix or Blogger, they might tell Google to crawl any updated or new pages that you make.

See also  Website And Webpage Difference [5 Solid Areas]

Once Google discovers a page URL, it visits or crawls, the page to find out what’s on it. Google renders the page and analyzes both the text and non-text content and overall visual layout to decide where it should appear in search results. The better that Google can understand your site, the better we can match it to people who are looking for your content.

To improve your site crawling:

  • Verify that Google can reach the pages on your site and that they look correct. Google accesses the web as an anonymous user (a user with no passwords or information). Google should also be able to see all the images and other elements of the page to be able to understand it correctly. You can do a quick check by typing your page URL in the Mobile-Friendly Test.
  • If you’ve created or updated a single page, you can submit an individual URL to Google. To tell Google about many new or updated pages at once, use a sitemap.
  • If you ask Google to crawl only one page, make it your home page. Your home page is the most important page on your site, as far as Google is concerned. To encourage a complete site crawl, be sure that your home page (and all pages) contain a good site navigation system that links to all the important sections and pages on your site; this helps users (and Google) find their way around your site. For smaller sites (less than 1,000 pages), making Google aware of only your homepage is all you need, provided that Google can reach all your other pages by following a path of links that start from your homepage.
  • Get your page linked to another page that Google already knows about. However, be warned that links in advertisements, links that you pay for in other sites, links in comments, or other links that don’t follow the Google Webmaster Guidelines won’t be followed by Google.
Google doesn't accept payment to crawl a site more frequently, or rank it higher. If anyone tells you otherwise, they're wrong.

Indexing

After a page is discovered, Google tries to understand what the page is about. This process is called indexing. Google analyzes the content of the page, catalogs images and video files embedded on the page, and otherwise tries to understand the page. This information is stored in the Google index, a huge database stored in many, many (many!) computers.

To improve your page indexing:

  • Create short, meaningful page titles.
  • Use page headings that convey the subject of the page.
  • Use text rather than images to convey content. Google can understand some image and video, but not as well as it can understand text. At minimum, annotate your video and images with alt text and other attributes as appropriate.

Serving (and ranking)

When a user types a query, Google tries to find the most relevant answer from its index based on many factors. Google tries to determine the highest quality answers, and factor in other considerations that will provide the best user experience and most appropriate answer, by considering things such as the user’s location, language, and device (desktop or phone). For example, searching for “bicycle repair shops” would show different answers to a user in Paris than it would to a user in Hong Kong. Google doesn’t accept payment to rank pages higher, and ranking is done programmatically.

To improve your serving and ranking:

The Long Version Of How Google Search Works

The Long Version Of How Google Search Works

Want more information? Here it is:The long version

Crawling

Crawling is the process by which Googlebot visits new and updated pages to be added to the Google index.

Google uses a huge set of computers to fetch (or “crawl”) billions of pages on the web. The program that does the fetching is called Googlebot (also known as a robot, bot, or spider). Googlebot uses an algorithmic process to determine which sites to crawl, how often, and how many pages to fetch from each site.

Google’s crawl process begins with a list of web page URLs, generated from previous crawl processes, augmented by Sitemap data provided by website owners. When Googlebot visits a page it finds links on the page and adds them to its list of pages to crawl. New sites, changes to existing sites, and dead links are noted and used to update the Google index.

During the crawl, Google renders the page using a recent version of Chrome. As part of the rendering process, it runs any page scripts it finds. If your site uses dynamically-generated content, be sure that you follow the JavaScript SEO basics.

Primary crawl / secondary crawl

Google uses two different crawlers for crawling websites: a mobile crawler and a desktop crawler. Each crawler type simulates a user visiting your page with a device of that type.

Google uses one crawler type (mobile or desktop) as the primary crawler for your site. All pages on your site that are crawled by Google are crawled using the primary crawler. The primary crawler for all new websites is the mobile crawler.

In addition, Google recrawls a few pages on your site with the other crawler type (mobile or desktop). This is called the secondary crawl, and is done to see how well your site works with the other device type.

How does Google know which pages not to crawl?

  • Pages blocked in robots.txt won’t be crawled, but still might be indexed if linked to by another page. (Google can infer the content of the page by a link pointing to it, and index the page without parsing its contents.)
  • Google can’t crawl any pages not accessible by an anonymous user. Thus, any login or other authorization protection will prevent a page from being crawled.
  • Pages that have already been crawled and are considered duplicates of another page are crawled less frequently.

Improve your crawling

Use these techniques to help Google discover the right pages on your site:

See also  6 Ways To Make Cell Phone Battery Last Longer

Indexing

Googlebot processes each page it crawls in order to understand the content of the page. This includes processing the textual content, key content tags and attributes, such as <title> tags and alt attributes, images, videos, and more. Googlebot can process many, but not all, content types. For example, we cannot process the content of some rich media files.

Somewhere between crawling and indexing, Google determines if a page is a duplicate or canonical of another page. If the page is considered a duplicate, it will be crawled much less frequently. Similar pages are grouped together into a document, which is a group of 1 or more pages that includes the canonical page (the most representative of the group) and any duplicates found (which might simply be alternate URLs to reach the same page, or might be alternate mobile or desktop versions of the same page).

Note that Google doesn’t index pages with a noindex directive (header or tag). However, it must be able to see the directive; if the page is blocked by a robots.txt file, a login page, or other device, it is possible that the page might be indexed even if Google didn’t visit it!

Improve your indexing

There are many techniques to improve Google’s ability to understand the content of your page:

What is a “document”?

Internally, Google represents the web as an (enormous) set of documents. Each document represents one or more web pages. These pages are either identical or very similar, but are essentially the same content, reachable by different URLs. The different URLs in a document can lead to exactly the same page (for instance, example.com/dresses/summer/1234 and example.com?product=1234 might show the same page), or the same page with small variations intended for users on different devices (for example, example.com/mypage for desktop users and m.example.com/mypage for mobile users).

Google chooses one of the URLs in a document and defines it as the document’s canonical URL. The document’s canonical URL is the one that Google crawls and indexes most often; the other URLs are considered duplicates or alternates, and may occasionally be crawled, or served according to the user request: for instance, if a document’s canonical URL is the mobile URL, Google will still probably serve the desktop (alternate) URL for users searching on desktop.

Most reports in Search Console attribute data to the document’s canonical URL. Some tools (such as the URL Inspection tool) support testing alternate URLs, but inspecting the canonical URL should provide information about the alternate URLs as well.

You can tell Google which URL you prefer to be canonical, but Google may choose a different canonical for various reasons.

Summary of terms, and how they are used in Search Console:

  • Document: A collection of similar pages. Has a canonical URL, and possibly alternate URLs, if your site has duplicate pages. URLs in the document can be from the same or different organization (the root domain, for example “google” in www.google.com).
  • Google chooses the best URL to show in Search results according to the platform (mobile/desktop), user language or location, and many other variables. Google discovers related pages on your site by organic crawling, or by site-implemented features such as redirects or <link rel=alternate/canonical> tags. Related pages on other organizations can only be marked as alternates if explicitly coded by your site (through redirects or link tags).
  • URL: The URL used to reach a given piece of content on a site. The site might resolve different URLs to the same page.
  • Page: A given web page, reached by one or more URLs. There can be different versions of a page, depending on the user’s platform (mobile, desktop, tablet, and so on).
  • Version: One variation of the page, typically categorized as “mobile”, “desktop”, and “AMP” (although AMP can itself have mobile and desktop versions). Each version can have a different URL (example.com vs m.example.com) or the same URL (if your site uses dynamic serving or responsive web design, the same URL can show different versions of the same page) depending on your site configuration. Language variations are not considered different versions, but different documents.
  • Canonical page or URL: The URL that Google considers as most representative of the document. Google always crawls this URL; duplicate URLs in the document are occasionally crawled as well.
  • Alternate/duplicate page or URL: The document URL that Google might occasionally crawl. Google also serves these URLs if they are appropriate to the user and request (for example, an alternate URL for desktop users will be served for desktop requests rather than a canonical mobile URL).
  • Site: Usually used as a synonym for a website (a conceptually related set of web pages), but sometimes used as a synonym for a Search Console property, although a property can actually be defined as only part of a site. A site can span subdomains (and even domains, for properly linked AMP pages).

Pages with the same content in different languages are stored in different documents that reference each other using hreflang tags; this is why it’s important to use hreflang tags for translated content.

Serving Results

When a user enters a query, our machines search the index for matching pages and return the results we believe are the most relevant to the user. Relevancy is determined by hundreds of factors, and we always work on improving our algorithm. Google considers the user experience in choosing and ranking results, so be sure that your page loads fast and is mobile-friendly.

Improving your serving

  • If your results are aimed at users in specific locations or languages, you can tell Google your preferences.
  • Be sure that your page loads fast and is mobile-friendly.
  • Follow the Webmaster Guidelines to avoid common pitfalls and improve your site’s ranking.
  • Consider implementing Search result features for your site, such as recipe cards or article cards.
  • Implement AMP for faster loading pages on mobile devices. Some AMP pages are also eligible for additional search features, such as the top stories carousel.
  • Google’s algorithm is constantly being improved; rather than trying to guess the algorithm and design your page for that, work on creating good, fresh content that users want, and following our guidelines.