Scraping using the Scrapy framework

First you have to set up a new Scrapy project. Enter a directory where you’d like to store your code and run:

scrapy startproject projectName

To scrape we need a spider. Spiders define how a certain site will be scraped. Here’s the code for a spider that follows the links to the top voted questions on StackOverflow and scrapes some data from each page (source):

import scrapy

class StackOverflowSpider(scrapy.Spider):
    name = 'stackoverflow'  # each spider has a unique name
    start_urls = ['http://stackoverflow.com/questions?sort=votes']  # the parsing starts from a specific set of urls

    def parse(self, response):  # for each request this generator yields, its response is sent to parse_question
        for href in response.css('.question-summary h3 a::attr(href)'):  # do some scraping stuff using css selectors to find question urls 
            full_url = response.urljoin(href.extract())
            yield scrapy.Request(full_url, callback=self.parse_question)

    def parse_question(self, response): 
        yield {
            'title': response.css('h1 a::text').extract_first(),
            'votes': response.css('.question .vote-count-post::text').extract_first(),
            'body': response.css('.question .post-text').extract_first(),
            'tags': response.css('.question .post-tag::text').extract(),
            'link': response.url,
        }

Save your spider classes in the projectName\spiders directory. In this case - projectName\spiders\stackoverflow_spider.py.

Now you can use your spider. For example, try running (in the project’s directory):

scrapy crawl stackoverflow

Found a mistake? Have a question or improvement idea? Let me know.

Web scraping:

* Web scraping with Python

* Scraping using the Scrapy framework

* Basic example of using requests and lxml to scrape some data

* Maintaining web-scraping session with requests

* Scraping using Selenium WebDriver

* Scraping using BeautifulSoup4

* Modify Scrapy user agent

* Simple web content download with urllib.request

* Scraping with curl

Table Of Contents

0 Getting Started

1 List comprehension

2 Filter

3 List

4 Functions

5 Decorators

6 Math module

7 Loops

8 Random module

9 Comparisons

10 Importing modules

11 Sorting Minimum and Maximium

12 Operator module

13 Variable Scope and Binding

14 Basic Input and Output

15 Files, Folders, I/O

16 JSON Module

17 String Methods

18 Metaclasses

19 Indexing and Slicing

20 Generators

21 Simple Mathematical Operators

22 Reduce

23 Map Function

24 Exponentation

25 Searching

26 Dictionary

27 Classes

28 Counting

29 Manipulating XML

30 Date and Time

31 Set

32 Collections module

33 Parallel computation

34 Multithreading

35 Writing C extensions

36 Unit Testing

37 Regular Expressions

38 Bitwise Operators

39 Incompatibilities moving from Python 2 to Python 3

40 Virtual environments

41 Copying data

42 Tuple

43 Context Managers with Statement

44 Hidden Features

45 Enum

46 String Formatting

47 Conditionals

48 Complex math

49 Unicode and bytes

50 The __name__ special variable

51 Check if path exists

52 Networking

53 Asyncio Module

54 Print Function

55 os.path module

56 Creating Python packages

57 Parsing Command Line Arguments

58 HTML Parsing

59 Subprocess Library

60 setup.py

61 List slicing

62 Sockets

63 Itertools Module

64 Recursion

65 Boolean Operators

66 dis module

67 Type Hints

68 pip PyPI Package Manager

69 locale module

70 Exceptions

71 Web scraping

72 deque module

73 Distributing self-contained applications

74 Property Objects

75 Overloading

76 Debugging

77 Reading and Writing CSV

78 Dynamic code execution with exec and eval

79 PyInstaller - Distributing Python Code

80 Iterables and Iterators

81 Data Visualization

82 The interpreter command line console

83 args and kwargs

84 functools module

85 Garbage Collection

86 Indentation

87 Security and Cryptograhy

88 Pickle data serialization

89 urllib

90 Binary Data

91 Python and Excel

92 Idioms

93 Method Overriding

94 Difference between a module and a package

95 Data Serialization

96 Python Concurrency

97 RabbitMQ using AMQPStorm

98 PostgreSQL

99 Descriptor

100 Common Pitfalls

101 Multiprocessing

102 Creating temporary files with tempfile

103 Working with ZIP files

104 Stack

105 Profiling

106 User-Defined Methods

107 Working around Global Interpreter Lock

108 Deployment using conda

109 Logging

110 Processes and Threads

111 os module

112 Comments and documentation

113 Database Access

114 Python HTTP Server

115 Alternatives to switch statement from other languages

116 List destructuring

117 Accessing Python source code and bytecode

118 Mixins

119 Attribute Access

120 ArcPy

121 Python Anti-Patterns

122 Plugin and Extension Classes

123 Websockets

124 Immutable data types

125 String representations of class

126 Arrays

127 Operator Precedence

128 Polymorphism

129 Alternative Python implementations

130 List Comprehensions

131 Web Server Gateway Intrerface WSGI

132 2to3 tool

133 Abstract Syntax Tree

134 Abstract Base Classes

135 Unicode

136 ssh in Python

137 Serial Communication with pyserial

138 Neo4j

139 Performance optimization

140 Curses

141 Templates

142 pass statement

143 Testing with py.test

144 Date Formatting

145 heapq

146 tkinter

147 CLI subcommands

148 Defining functions with list arguments

149 SQLite3 module

150 Persistence with pickle

151 Connecting to SQL Server

152 Design Patterns

153 Multidimensional arrays

154 Audio

155 pyglet

156 queue module

157 ijson

158 webbrowser module

159 base64 module

160 Flask

161 Groupby

162 Sockets and Message Encryption / Decryption

163 pygame

164 Input Subset and Output External Data Files using Pandas

165 hashlib

166 Gzip

167 ctypes

168 Creating a Windows Service

169 Mutable vs. Immutable

170 configparser

171 Common Exceptions

172 Optical Character Recognition OCR

173 Python Data Types

174 Partial functions

175 Generating graphs

176 Unzipping Files

177 Functional Programming

178 Python Virtual Environment - virtualenv

179 sys module

180 virtual environment with virtualenvwrapper

181 virtualenvwrapper on Windows

182 Python Requests Post

183 Plotting with Matplotlib

184 Python Lex-Yacc

185 pyaudio

186 shelve

187 pip and PyPI Package Manager

188 Writing to CSV from String or List

189 Raise Custom Errors Exceptions

190 Using loops within functions

191 Contributors