Web scraping with Scrapy

As a tabletop RPG game master, whenever I need to imagine a universe background for a scenario, I need illustrations to picture myself the atmosphere, and get some inspiration. I usually simply surf the web from blog to blog, or spend some time on inspirational websites like DeviantArt, Tumblr, Unurth or DarkRoastedBlend.

But last week, I stumbled upon Sylvain Robert great illustration gallery. It has thousands of fantasy images, from D&D to comics.

But... I was only looking for goblin illustrations.

Hence, I decided to scrape all the file names from the Apache directory index pages.

My first attemps involved wget and httrack, but I found no way to use their spider mode to only list the file names AND ignore urls containing the string "fichiers/".

Hence, my fallback solution was Scrapy, which leverage all the flexibility and simplicty of Python for web crawling.

In a matter of minutes, I was able to write a very basic but functional scraper. There is a refined version:

from collections import Counter, OrderedDict
import json
import os
import scrapy
import urlparse

class FileUrl(scrapy.Item):
    url = scrapy.Field()

class HtmlDirectoryCrawler(scrapy.Spider):
    name = os.path.basename(__file__)
    def __init__(self, url=''):
        self.start_urls = [url]
        self.ext_counter = Counter()
    def parse(self, response):
        for href in response.xpath('//a/@href')[5:]:
        # Skipping the 5 first hrefs: Name, Last modified, Size, Description, Parent Folder
            child_url = urlparse.urljoin(response.url, href.extract())
            is_garbage = child_url.endswith('_fichiers/')
            if is_garbage:
                continue
            is_folder = (child_url[-1] == '/')
            if is_folder:
                yield scrapy.Request(child_url, self.parse)
            else:
                ext = child_url.split('.')[-1]
                self.ext_counter[ext] += 1
                yield FileUrl(url=child_url)
    def closed(self, reason):
        ordered_ext_counter = OrderedDict(sorted(
                self.ext_counter.iteritems(),
                key=lambda (k,v): (v,k)))
        self.log("Stats on extensions found:\n{}".format(
                json.dumps(ordered_ext_counter, indent=4)),
                level=scrapy.log.INFO)

At the end of the scraping, I wanted to know the count of files per extension. To do so, I incremented a collections.Counter for every file processed. Then, at the end, I used a collections.OrderedDict and json.dumps to pretty-print the dictionnary of extensions frequencies.

The following command runs the scraper and dumps its output in a file named crpp0001.uqtr.ca.json :

# time scrapy runspider --pdb -L INFO html_dir_crawler.py -a url=http://crpp0001.uqtr.ca/w4/campagne/images -o crpp0001.uqtr.ca.json
...
2014-09-09 13:47:48+0200 [html_dir_crawler.pyc] INFO: Spider opened
2014-09-09 13:47:48+0200 [html_dir_crawler.pyc] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-09-09 13:48:48+0200 [html_dir_crawler.pyc] INFO: Crawled 284 pages (at 284 pages/min), scraped 11343 items (at 11343 items/min)
2014-09-09 13:49:48+0200 [html_dir_crawler.pyc] INFO: Crawled 957 pages (at 673 pages/min), scraped 29275 items (at 17932 items/min)
2014-09-09 13:50:43+0200 [html_dir_crawler.pyc] INFO: Closing spider (finished)
2014-09-09 13:50:43+0200 [html_dir_crawler.pyc] INFO: Stats on extensions found:
...
            "PNG": 60,
            "URL": 69,
            "jpe": 70,
            "pdf": 134,
            "html": 162,
            "FCW": 194,
            "zip": 216,
            "BMP": 235,
            "htm": 419,
            "psd": 608,
            "bmp": 657,
            "jpeg": 854,
            "fcw": 892,
            "db": 1051,
            "JPG": 1192,
            "gif": 1872,
            "png": 2989,
            "jpg": 27162
        }
2014-09-09 13:50:43+0200 [html_dir_crawler.pyc] INFO: Stored json feed (39340 items) in: crpp0001.uqtr.ca.json
...
real    0m45.489s
user    0m27.370s
sys     0m0.553s

And now I can search all those JPEG file names with a simple grep !

# grep -ic goblin crpp0001.uqtr.ca.json
35
# grep -i space crpp0001.uqtr.ca.json | grep -iv espace | wc -l
56




EDIT[8/10/2014]: as an exercise, you could try to scrape all the tags from Justin Mason Weblog and then use tagcrowd.com (or the fantastic Javascript library d3-cloud) to generate the following tags cloud:

Justin Mason Weblog PNG Tags Cloud