Web Scraping with Python

by Ryan Mitchell

Programming

Book Details

Book Title

Web Scraping with Python

Author

Ryan Mitchell

Publisher

O'Reilly Media

Publication Date

2024

ISBN

9781098145354

Number of Pages

469

Language

English

Format

PDF

File Size

11.9MB

Subject

Python for Data Extraction

Table of Contents

  • Preface
  • I. Building Scrapers
  • 1. How the Internet Works
  • Networking
  • HTML
  • CSS
  • JavaScript
  • Watching Websites with Developer Tools
  • 2. The Legalities and Ethics of Web Scraping
  • Trademarks, Copyrights, Patents, Oh My!
  • Trespass to Chattels
  • The Computer Fraud and Abuse Act
  • robots.txt and Terms of Service
  • Three Web Scrapers
  • 3. Applications of Web Scraping
  • Classifying Projects
  • E-commerce
  • Academic Research
  • Product Building
  • Travel
  • Sales
  • SERP Scraping
  • 4. Writing Your First Web Scraper
  • Installing and Using Jupyter
  • Connecting
  • An Introduction to BeautifulSoup
  • 5. Advanced HTML Parsing
  • Another Serving of BeautifulSoup
  • Regular Expressions
  • Regular Expressions and BeautifulSoup
  • Accessing Attributes
  • Lambda Expressions
  • You Don’t Always Need a Hammer
  • 6. Writing Web Crawlers
  • Traversing a Single Domain
  • Crawling an Entire Site
  • Crawling Across the Internet
  • 7. Web Crawling Models
  • Planning and Defining Objects
  • Dealing with Different Website Layouts
  • Structuring Crawlers
  • Thinking About Web Crawler Models
  • 8. Scrapy
  • Installing Scrapy
  • Writing a Simple Scraper
  • Spidering with Rules
  • Creating Items
  • Outputting Items
  • The Item Pipeline
  • Logging with Scrapy
  • More Resources
  • 9. Storing Data
  • Media Files
  • Storing Data to CSV
  • MySQL
  • Email
  • II. Advanced Scraping
  • 10. Reading Documents
  • Document Encoding
  • Text
  • CSV
  • PDF
  • Microsoft Word and .docx
  • 11. Working with Dirty Data
  • Cleaning Text
  • Working with Normalized Text
  • Cleaning Data with Pandas
  • 12. Reading and Writing Natural Languages
  • Summarizing Data
  • Markov Models
  • Natural Language Toolkit
  • Additional Resources
  • 13. Crawling Through Forms and Logins
  • Python Requests Library
  • Submitting a Basic Form
  • Radio Buttons, Checkboxes, and Other Inputs
  • Submitting Files and Images
  • Handling Logins and Cookies
  • Other Form Problems
  • 14. Scraping JavaScript
  • A Brief Introduction to JavaScript
  • Ajax and Dynamic HTML
  • Executing JavaScript in Python with Selenium
  • Additional Selenium WebDrivers
  • Handling Redirects
  • A Final Note on JavaScript
  • 15. Crawling Through APIs
  • A Brief Introduction to APIs
  • Parsing JSON
  • Undocumented APIs
  • Combining APIs with Other Data Sources
  • More About APIs
  • 16. Image Processing and Text Recognition
  • Overview of Libraries
  • Processing Well-Formatted Text
  • Reading CAPTCHAs and Training Tesseract
  • Retrieving CAPTCHAs and Submitting Solutions
  • 17. Avoiding Scraping Traps
  • A Note on Ethics
  • Looking Like a Human
  • Common Form Security Features
  • The Human Checklist
  • 18. Testing Your Website with Scrapers
  • An Introduction to Testing
  • Python unittest
  • Testing with Selenium
  • 19. Web Scraping in Parallel
  • Processes Versus Threads
  • Multithreaded Crawling
  • Multiple Processes
  • Multiprocess Crawling—Another Approach
  • 20. Web Scraping Proxies
  • Why Use Remote Servers?
  • Tor
  • Remote Hosting
  • Web Scraping Proxies
  • Additional Resources
  • Index
  • About the Author