Web scraping with Python

suggest change

Introduction

Web scraping is an automated, programmatic process through which data can be constantly ‘scraped’ off webpages. Also known as screen scraping or web harvesting, web scraping can provide instant data from any publicly accessible webpage. On some websites, web scraping may be illegal.

Remarks

Useful Python packages for web scraping (alphabetical order)

Making requests and collecting data

requests

A simple, but powerful package for making HTTP requests.

requests-cache

Caching for requests; caching data is very useful. In development, it means you can avoid hitting a site unnecessarily. While running a real collection, it means that if your scraper crashes for some reason (maybe you didn’t handle some unusual content on the site…? maybe the site went down…?) you can repeat the collection very quickly from where you left off.

scrapy

Useful for building web crawlers, where you need something more powerful than using requests and iterating through pages.

selenium

Python bindings for Selenium WebDriver, for browser automation. Using requests to make HTTP requests directly is often simpler for retrieving webpages. However, this remains a useful tool when it is not possible to replicate the desired behaviour of a site using requests alone, particularly when JavaScript is required to render elements on a page.

HTML parsing

BeautifulSoup

Query HTML and XML documents, using a number of different parsers (Python’s built-in HTML Parser,html5lib, lxml or lxml.html)

lxml

Processes HTML and XML. Can be used to query and select content from HTML documents via CSS selectors and XPath.

Feedback about page:

Feedback:
Optional: your email if you want me to get back to you:


Web scraping:
* Web scraping with Python

Table Of Contents
2 Filter
3 List
7 Loops
22 Reduce
27 Classes
31 Set
42 Tuple
45 Enum
62 Sockets
71 Web scraping
89 urllib
92 Idioms
104 Stack
105 Profiling
109 Logging
111 os module
118 Mixins
120 ArcPy
126 Arrays
132 2to3 tool
135 Unicode
138 Neo4j
140 Curses
141 Templates
145 heapq
146 tkinter
154 Audio
155 pyglet
157 ijson
160 Flask
161 Groupby
163 pygame
165 hashlib
166 Gzip
167 ctypes
185 pyaudio
186 shelve