Skip to main content
  1. Projects/

🔀 Multithreading Scraping

·188 words
Engineering Python
Ksenia Legostay
Author
Ksenia Legostay
I love playing around with data in my free time

Parallel web scraping
#

The project is a web scraping script built with Python that utilizes multithreading and a real-time updated list of available proxy servers.

Goal
#

The script extracts names and prices of the Top-100 crypto coins and stores the data into a database.

Project features
#

  • Multiple pages with single-level nesting have been scraped. Propagation is handled by collecting internal links from the main page and iterating through them.
  • To prevent bans from the remote server, a proxy management mechanism was implemented.
  • Since free public proxy servers are often unreliable, the following approach was used to address this issue:
    • A separate scraping script extracts a list of free public proxy servers from a website.
    • With each script execution, the list of 10 proxy servers is updated with currently available ones.
    • During execution, some proxies may become unavailable. Each scraping request cycles through the list to find an active proxy before proceeding.
  • To accelerate the scraping of all 101 web pages, multithreading was utilized. The workload is distributed across four threads running concurrently.
  • The extracted data is written directly to a database.

Project repository.
#

Link