The wget
command is a powerful utility in Unix-based systems used to download content from the web. One of its most useful features is the ability to download entire websites recursively, allowing you to create local copies of a site for offline use or archiving purposes. Below, we will break down how the wget -r https://example.com
command works and how you can customize it with various options.
What Does the `-r` Option Do?
The -r
option stands for “recursive,” and it tells wget
to download the website and all of its content. This includes images, stylesheets, and any other files linked within the site. When you use this option, wget
will follow links within the pages to retrieve all associated resources and create a complete local copy.
Basic Syntax
The basic command to download a website recursively is:
wget -r https://example.com
This will download the website starting from the homepage and follow links recursively to gather all files linked from the site.
Additional Options for Customizing Your Download
While the -r
option is the core feature, there are several other options you can combine with it to further control how wget
behaves.
1. Limiting Depth of Recursion
If you don’t want wget
to download the entire site, you can limit the depth of recursion. Use the --level
option to specify how many levels deep wget
should go. For example:
wget -r --level=2 https://example.com
This command will download the site recursively but only up to two levels deep.
2. Downloading Only Certain File Types
If you’re interested in downloading specific types of files, you can use the --accept
or --reject
options. For instance, to download only images (JPG, PNG, GIF), you can run:
wget -r --accept jpg,png,gif https://example.com
This will limit the download to only the specified image file formats.
3. Preventing the Download of Certain Files
You can also prevent certain file types or resources from being downloaded using the --reject
option. For example, to exclude PDF files:
wget -r --reject pdf https://example.com
This command will download everything except PDF files.
4. Downloading Only HTML Files
If you’re interested in downloading only the HTML pages and excluding other files, use the --no-parent
option. It ensures that only the pages below the specified URL are downloaded, and the site structure remains intact:
wget -r --no-parent https://example.com
5. Handling Timeouts and Retries
If the server is slow or temporarily unreachable, you can configure wget
to retry automatically or wait longer before timing out. For instance, to increase the timeout and retry limit:
wget -r --timeout=30 --tries=5 https://example.com
This will set the timeout to 30 seconds and allow up to 5 retries in case of failure.
Common Use Cases
- Archiving a Website: Download a full website for backup or archiving purposes, useful for offline reading or saving a website before it’s updated or taken down.
- Offline Browsing: Save an entire site to browse it offline, such as documentation or a personal blog.
- Website Crawling: Use the recursive download feature to crawl and retrieve resources from a site for further analysis or processing.
Conclusion
The wget
command is a versatile tool that allows you to download entire websites recursively. By customizing the options with different parameters, you can fine-tune the command to suit your needs, from limiting recursion depth to specifying which file types to include or exclude. Whether you’re archiving content, downloading resources for offline use, or scraping data, wget
is an invaluable tool for anyone working with the web.