Skip to content

Using ‘wget ‘to Backup your Website

The wget command is a powerful utility in Unix-based systems used to download content from the web. One of its most useful features is the ability to download entire websites recursively, allowing you to create local copies of a site for offline use or archiving purposes. Below, we will break down how the wget -r https://example.com command works and how you can customize it with various options.

What Does the `-r` Option Do?

The -r option stands for “recursive,” and it tells wget to download the website and all of its content. This includes images, stylesheets, and any other files linked within the site. When you use this option, wget will follow links within the pages to retrieve all associated resources and create a complete local copy.

Basic Syntax

The basic command to download a website recursively is:

wget -r https://example.com

This will download the website starting from the homepage and follow links recursively to gather all files linked from the site.

Additional Options for Customizing Your Download

While the -r option is the core feature, there are several other options you can combine with it to further control how wget behaves.

1. Limiting Depth of Recursion

If you don’t want wget to download the entire site, you can limit the depth of recursion. Use the --level option to specify how many levels deep wget should go. For example:

wget -r --level=2 https://example.com

This command will download the site recursively but only up to two levels deep.

2. Downloading Only Certain File Types

If you’re interested in downloading specific types of files, you can use the --accept or --reject options. For instance, to download only images (JPG, PNG, GIF), you can run:

wget -r --accept jpg,png,gif https://example.com

This will limit the download to only the specified image file formats.

3. Preventing the Download of Certain Files

You can also prevent certain file types or resources from being downloaded using the --reject option. For example, to exclude PDF files:

wget -r --reject pdf https://example.com

This command will download everything except PDF files.

4. Downloading Only HTML Files

If you’re interested in downloading only the HTML pages and excluding other files, use the --no-parent option. It ensures that only the pages below the specified URL are downloaded, and the site structure remains intact:

wget -r --no-parent https://example.com

5. Handling Timeouts and Retries

If the server is slow or temporarily unreachable, you can configure wget to retry automatically or wait longer before timing out. For instance, to increase the timeout and retry limit:

wget -r --timeout=30 --tries=5 https://example.com

This will set the timeout to 30 seconds and allow up to 5 retries in case of failure.

Common Use Cases

  • Archiving a Website: Download a full website for backup or archiving purposes, useful for offline reading or saving a website before it’s updated or taken down.
  • Offline Browsing: Save an entire site to browse it offline, such as documentation or a personal blog.
  • Website Crawling: Use the recursive download feature to crawl and retrieve resources from a site for further analysis or processing.

Conclusion

The wget command is a versatile tool that allows you to download entire websites recursively. By customizing the options with different parameters, you can fine-tune the command to suit your needs, from limiting recursion depth to specifying which file types to include or exclude. Whether you’re archiving content, downloading resources for offline use, or scraping data, wget is an invaluable tool for anyone working with the web.