In this guide, you will learn:
- What Gospider is and how it works
- What features it offers
- How to use it for web crawling
- How to integrate it with Colly for web scraping
- Its main limitations and how to bypass them
Let’s dive in!
What Is Gospider?
Gospider is a fast and efficient web crawling CLI tool written in Go. It is built to scan websites and extract URLs in parallel, handling multiple requests and domains at the same time. Additionally, it respects robots.txt
and can discover links even in JavaScript files.
Gospider offers several customization flags to control crawling depth, request delays, and more. It also supports proxy integration, along with various other options for greater control over the crawling process.
What Makes Gospider Unique for Web Crawling?
To better understand why Gospider is special for web crawling, let’s explore its features in detail and examine the supported flags.
Features
Below are the main features provided by Gospider when it comes to web crawling:
- Fast web crawling: Efficiently crawl single websites at high speed.
- Parallel crawling: Crawls multiple sites concurrently for faster data collection.
sitemap.xml
parsing: Automatically handles sitemap files for enhanced crawling.robots.txt
parsing: Complies withrobots.txt
directives for ethical crawling.- JavaScript link parsing: Extracts links from JavaScript files.
- Customizable crawl options: Adjust crawl depth, concurrency, delay, timeouts, and more with flexible flags.
User-Agent
randomization: Randomizes between mobile and web User-Agents for more realistic requests. Discover the bestUser-Agent
for web crawling.- Cookie and header customization: Allows custom cookies and HTTP headers.
- Link finder: Identifies URLs and other resources on a site.
- Find AWS S3 buckets: Detects AWS S3 buckets from response sources.
- Find subdomains: Discovers subdomains from response sources.
- Third-party sources: Extracts URLs from services like the Wayback Machine, Common Crawl, VirusTotal, and Alien Vault.
- Easy output formatting: Outputs results in formats that are easy to
grep
and analyze. - Burp Suite support: Integrates with Burp Suite for easier testing and crawling.
- Advanced filtering: Blacklists and whitelists URLs, including domain-level filtering.
- Subdomain support: Includes subdomains in crawls from both the target site and third-party sources.
- Debug and verbose modes: Enables debugging and detailed logging for easier troubleshooting.
Command Line Options
This is how a generic Gospider command looks like:
In particular, the supported flags are:
-s, --site
: Site to crawl.-S, --sites
: List of sites to crawl.-p, --proxy
: Proxy URL.-o, --output
: Output folder.-u, --user-agent
: User Agent to use (e.g.,web
,mobi
, or a custom user-agent).--cookie
: Cookie to use (e.g.,testA=a; testB=b
).-H, --header
: Header(s) to use (you repeat the flag multiple times for multiple headers).--burp string
: Load headers and cookies from a Burp Suite raw HTTP request.--blacklist
: Blacklist URL Regex.--whitelist
: Whitelist URL Regex.--whitelist-domain
: Whitelist Domain.-t, --threads
: Number of threads to run in parallel (default:1
).-c, --concurrent
: Maximum concurrent requests for matching domains (default:5
).-d, --depth
: Maximum recursion depth for URLs (set to0
for infinite recursion, default:1
).-k, --delay int
: Delay between requests (in seconds).-K, --random-delay int
: Extra randomized delay before making requests (in seconds).-m, --timeout int
: Request timeout (in seconds, default:10
).-B, --base
: Disable all and only use HTML content.--js
: Enable link finder in JavaScript files (default:true
).--subs
: Include subdomains.--sitemap
: Crawlsitemap.xml
.--robots
: Crawlrobots.txt
(default:true
).-a, --other-source
: Find URLs from 3rd party sources like Archive.org, CommonCrawl, VirusTotal, AlienVault.-w, --include-subs
: Include subdomains crawled from 3rd party sources (default: only main domain).-r, --include-other-source
: Include URLs from 3rd party sources and still crawl them--debug
: Enable debug mode.--json
: Enable JSON output.-v, --verbose
: Enable verbose output.-l, --length
: Show URL length.-L, --filter-length
: Filter URLs by length.-R, --raw
: Show raw output.-q, --quiet
: Suppress all output and only show URLs.--no-redirect
: Disable redirects.--version
: Check version.-h, --help
: Show help.
Web Crawling with Gospider: Step-by-Step Guide
In this section, you will learn how to use Gospider to crawl links from a multipage site. Specifically, the target site will be Books to Scrape:
The site contains a list of products spread across 50 pages. Each product entry on these listing pages also has its own dedicated product page. The steps below will guide you through the process of using Gospider to retrieve all those product page URLs!
Prerequisites and Project Setup
Before you start, ensure you have the following:
- Go installed on your computer: If you have not installed Go yet, download it from the official website and follow the installation instructions.
- A Go IDE: Visual Studio Code with the Go extension is recommended.
To verify that Go is installed, run:
If Go is installed correctly, you should see output similar to this (on Windows):
Great! Go is set up and ready to go.
Create a new project folder and navigate to it in the terminal:
Now, you are ready to install Gospider and use it for web crawling!
Step #1: Install Gospider
Run the following go install
command to compile and install Gospider globally:
After installation, verify that Gospider is installed by running:
This should print the Gospider usage instructions as below:
Amazing! Gospider has been installed, and you can now use it to crawl one or more websites.
Step #2: Crawl URLs on the Target Page
To crawl all links on the target page, run the following command:
This is a breakdown of the Gospider flags used:
-s "https://books.toscrape.com/"
: Specifies the target URL.-o output
: Saves the crawl results inside theoutput
folder.-d 1
: Sets the crawling depth to1
, meaning that Gospider will only detect URLs on the current page. In other words, it will not follow found URLs for deeper link discovery.
The above command will produce the following structure:
Open the books_toscrape_com
file inside the output
folder, and you will see output similar to this:
The generated file contains different types of detected links:
[url]
: The crawled pages/resources.[href]
: All<a href>
links found on the page.[javascript]
: URLs to JavaScript files.[linkfinder]
: Extracted links embedded in JavaScript code.
Step #3: Crawl the Entire Site
From the output above, you can see that Gospider stopped at the first pagination page. It detected the link to the second page but did not visit it.
You can verify this because the books_toscrape_com
file contains:
The [href]
tag indicates that the link was discovered. However, since there is no corresponding [url]
entry with the same URL, the link was found but never visited.
If you inspect the target page, you will see that the above URL corresponds to the second pagination page:
To crawl the entire website, you need to follow all pagination links. As shown in the image above, the target site contains 50 product pages (note the “Page 1 of 50” text). Set Gospider’s depth to 50
to ensure it reaches every page.
Since this will involve crawling a large number of pages, it is also a good idea to increase the concurrency rate (i.e., the number of simultaneous requests). By default, Gospider uses a concurrency level of 5
, but increasing it to 10
will speed up execution.
The final command to crawl all product pages is:
This time, Gospider will take longer to execute and produce thousands of URLs. The output will now contain entries like:
The key detail to check in the output is the presence of the URL of the last pagination page:
Wonderful! This confirms that Gospider successfully followed all pagination links and crawled the entire product catalog as intended.
Step #4: Get Only the Product Page
In just a few seconds, Gospider collected all URLs from an entire site. That could be the end of this tutorial, but let’s take it a step further.
What if you only want to extract product page URLs? To understand how these URLs are structured, inspect a product element on the target page:
From this inspection, you can notice how product page URLs follow this format:
To filter out only product pages from the raw crawled URLs, you can use a custom Go script.
First, create a Go module inside your Gospider project directory:
Next, create a crawler
folder inside the project directory and add a crawler.go
file to it. Then, open the project folder in your IDE. Your folder structure should now look like this:
The crawler.go
script should:
- Run the Gospider command from a clean state.
- Read all URLs from the output file.
- Filter only product page URLs using a regex pattern.
- Export the filtered product URLs to a .txt file.
Below is the Go code to accomplish the goal:
The Go program automates web crawling by utilizing:
os.RemoveAll()
to delete the output directory (output/
)—if it already exists—to guarantee a clean start.exec.Command()
andcmd.Run()
to construct and execute a Gospider command-line process to crawl the target website.os.Open()
andbufio.NewScanner()
to open the output file generated by Gospider (books_toscrape_com
) and read it line by line.regexp.MustCompile()
andFindAllString()
to use a regex to extract product page URLs from each line—employingslices.Contains()
to prevent duplicates.os.Create()
andbufio.NewWriter()
to write the filtered product page URLs to aproduct_urls.txt
file. Step #5: Crawling Script Execution Launch thecrawler.go
script with the following command:
The script will log the following in the terminal:
The Gospider crawling script successfully found 1,000 product page URLs. As you can easily verify on the target site, that is exactly the number of product pages available:
Those URLs will be stored in a product_urls.txt
file generated in your project folder. Open that file, and you will see:
Congrats! You just built a Gospider script to perform web crawling in Go.
[Extra] Add the Scraping Logic to the Gospider Crawler
Web crawling is generally just one step in a larger web scraping project. Learn more about the difference between these two practices by reading our guide on web crawling vs. web scraping.
To make this tutorial more complete, we will also demonstrate how to use the crawled links for web scraping. The Go scraping script we are about to build will:
- Read the product page URLs from the
product_urls.txt
file, which was generated earlier using Gospider and custom logic. - Visit each product page and scrape product data.
- Export the scraped product data to a CSV file.
Time to add web scraping logic to your Gospider setup!
Step #1: Install Colly
The library used for web scraping is Colly, an elegant scraper and crawler framework for Golang. If you are not familiar with its API, check out our tutorial on web scraping with Go.
Run the following command to install Colly:
Next, create a scraper.go
file inside the scraper
folder within your project directory. Your project structure should now look like this:
Open scraper.go
and import Colly:
Fantastic! Follow the steps below to use Colly for scraping data from the crawled product pages.
Step #2: Read the URLs to Scrape
Use the following code to retrieve the URLs of the product pages to scrape from the filtered_urls.txt
file, which was generated by crawler.go
:
To make the above snippet work, include these imports at the beginning of your file:
Great! The urls
slice will contain all the product page URLs ready for scraping.
Step #3: Implement the Data Extraction Logic
Before implementing the data extraction logic, you must understand the structure of the product page’s HTML.
To do that, visit a product page in your browser in incognito mode—to ensure a new session. Open DevTools and inspect the page elements, starting with the product image HTML node:
Next, inspect the product information section:
From the inspected elements, you can extract:
- The product title from the
<h1>
tag. - The product price from the first
.price_color
node on the page. - The product rating (stars) from the
.star-rating
class. - The product image URL from the
#product_gallery img
element.
Given these attributes, define the following Go struct to represent the scraped data:
Since multiple product pages will be scraped, define a slice to store the extracted products:
To scrape the data, start by initializing a Colly Collector
:
Use the OnHTML()
callback in Colly to define the scraping logic:
Note else if
structure used to get the star rating based on the class attribute of .star-rating
. Also, see how the relative image URL is converted to an absolute URL using strings.Replace()
.
Add the following required import:
Now your Go scraper is set up to extract product data as desired!
Step #4: Connect to the Target Pages
Colly is a callback-based web scraping framework with a specific callback lifecycle. That means you can define the scraping logic before retrieving the HTML, which is an unusual but powerful approach.
Now that the data extraction logic is in place, instruct Colly to visit each product page:
Note: The number of URLs has been limited to 50 to avoid overwhelming the target website with too many requests. In a production script, you can remove or adjust this limitation based on your needs.
Colly will now:
- Visit each URL in the list.
- Apply the
OnHTML()
callback to extract product data. - Store the extracted data in the
products
slice.
Amazing! All that is left is to export the scraped data to a human-readable format like CSV.
Step #5: Export the Scraped Data
Export the products
slice to a CSV file using the following logic:
The above snippet creates a products.csv
file and populates it with the scraped data.
Do not forget to import the CSV package from Go’s standard library:
This is it! Your Gospider crawling and scraping project is now fully implemented.
Step #6: Put It All Together
scraper.go
should now contain:
Launch the scraper with the command below:
The execution may take some time, so be patient. Once it completes, a products.csv
file will appear in the project folder. Open it, and you will see the scraped data neatly structured in a tabular format:
Et voilà! Gospider for crawling + Colly for scraping is a winning duo.
Limitations of Gospider’s Approach to Web Crawling
The biggest limitations of Gospider’s crawling approach are:
- IP bans due to making too many requests.
- Anti-crawling technologies used by websites to block crawling bots.
Let’s see how to tackle both!
Avoid IP Bans
The consequence of too many requests from the same machine is that your IP address may get banned by the target server. This is a common issue in web crawling, especially when it is not well-configured or ethically planned.
By default, Gospider respects robots.txt
to minimize this risk. However, not all websites have a robots.txt
file. Also, even when they do, it might not specify valid rate-limiting rules for crawlers.
To limit IP bans, you could try using Gospider’s built-in --delay
, --random-delay
, --timeout
flags to slow down requests. Still, finding the right combination can be time-consuming and may not always be effective.
A more effective solution is to use a rotating proxy, which guarantees that each request from Gospider will originate from a different IP address. That prevents the target site from detecting and blocking your crawling attempts.
To use a rotating proxy with Gospider, pass the proxy URL with the -p
(or --proxy
) flag:
If you do not have a rotating proxy URL, retrieve one for free!
Bypass Anti-Crawling Tech
Even with a rotating proxy, some websites implement strict anti-scraping and anti-crawling measures. For example, running this Gospider command against a Cloudflare-protected website:
The result will be:
As you can see, the target server responded with a 403 Forbidden
response. This means the server successfully detected and blocked Gospider’s request, preventing it from crawling any URLs on the page.
To avoid such blocks, you need an all-in-one web unlocking API. That service can bypass anti-bot and anti-scraping systems, giving you access to the unblocked HTML of any webpage.
Note: Bright Data’s Web Unlocker not only handles these challenges but can also operate as a proxy. So, once configured, you can use it just like a regular proxy with Gospider using the syntax shown earlier.
Conclusion
In this blog post, you learned what Gospider is, what it offers, and how to use it for web crawling in Go. You also saw how to combine it with Colly for a complete crawling and scraping tutorial.
One of the biggest challenges in web scraping is the risk of being blocked—whether due to IP bans or anti-scraping solutions. The best ways to overcome these challenges are using web proxies or a scraping API like Web Unlocker.
Integration with Gospider is just one of many scenarios that Bright Data’s products and services support. Explore our other web scraping tools:
- Web Scraper APIs: Dedicated endpoints for extracting fresh, structured web data from over 100 popular domains.
- SERP API: API to handle all ongoing unlocking management for SERP and extract one page.
- Scraping Functions: A complete scraping interface that allows you to run your scrapers as serverless functions.
- Scraping Browser: Puppeteer, Selenium, and Playwright-compatible browser with built-in unlocking activities
Sign up now to Bright Data and test our proxy services and scraping products for free!
No credit card required