Book cover

A step by step guide to web scraping

Web scraping or crawling is the art of fetching data from a third party website by downloading and parsing the HTML code to extract the data you want. It can be hard. From bad HTML code to heavy Javascript use and anti-bot techniques, it is often tricky. Lots of companies use it to obtain knowledge concerning competitor prices, news aggregation, mass email collect…

This book will teach you how to extract data from any website, how to deal with AJAX / Javascript heavy websites, break captchas, deploy your scrapers in the cloud and many other advanced techniques.

This is a pre-sale page, you will get the first 3 chapters with your pre-order. The book is expected to be finished first quarter 2018.

What you will get

150+ pages eBook

The book should end with 150-200 pages of detailed instructions, code examples, tips and exercices

Source code included

You will have full access to the source code of each chapters

Exercices

You will have access to a sandbox website and exercice to test you knowledge, and apply the techniques your learnt.

Table of Content

  • 1. Introduction to Web Scraping

    In this chapter you will learn what Web Scraping is. Who uses it, for what purpose, and the legal side.

  • 2. Web fundamentals

    You can't scrape the web before really undersanding it, we will go through each important fondation of the web : HTTP protocol, and the DOM.

  • 3. Extracting the data you want

    In this chapter you will learn how to parse simple HTML, through lots of different examples

  • 4. Handling forms

    Dealing with forms can be complicated, in this chapter I will show you how to pass through login forms, or post any forms

  • 5. Dealing with JavaScript

    JavaScript heavy website can be quite complicated to deal with. In this chapter we will see how to use Chrome in headless mode to handle this task.

  • 6. Captchas, Images Keypads and other beautiful things

    Learn how to deal with captchas, sign in "Images Keypad" protected login forms and other annoying things

  • 7. Stay under cover

    In this chapter we will see how to stay undetected, how to use proxies and make our scraping bots look like Humans

  • 8. Cloudy Scraping

    Learn how to run your scrapers in the cloud,to perform large scale web scraping tasks.

About me

picture of me
Hi there, I'm Kevin Sahin, the author of Java Web Scraping Handbook. I have a personnal blog where I write about Webscraping and software developpement. I am also the founder of SaasFactory a company that operates several Software as a Service tools

Previously I spent more than four years building large scale web scrapers in the fintech industry, we're talking about millions of web pages scraped each day. I got my BS in computer science at Paul Sabatier University, in Toulouse, France. I wish I had a book like this when I started my job, to answer all the questions I had. Unfortunally, there wasn't a lot of good resources about web scraping back then. But now there is :)