How to write a scraper

What are scrapers?

To be able to display and send alerts for planning applications, Planning Alerts needs to download applications from as many councils as possible. As the vast majority of councils don't supply the data in a reusable, machine-readable format we need to write web scrapers for each local government authority.

These scrapers fetch the data from council web pages and present it in a structured format so we can load it into the Planning Alerts database.

How can I help?

If you have some computer programming experience, you should be able to work out how to prepare a scraper for Planning Alerts. All Planning Alerts scrapers are hosted on our morph.io scraping platform, which takes care of all the boring bits of scraping for you (well, most of the boring bits!).

The next thing to do is to decide what council to scrape. Once you've picked one, look it up on our crowd-sourced list of councils and have a look at their published planning applications. Quickly double-check that the council isn't covered already.

Some systems for displaying development applications on council websites are widely used. For most of those we have already developed scrapers that are capable of scraping many authorities using the same system. Check whether the council you want to scrape is using one of the systems Masterview, Civica, Icon, ATDIS, Horizon, Technology One or Epathway. We don't yet have good documentation on how to recognise these different systems but that's something we want to create. Or maybe you can help?

An introduction to scraping with morph.io

With morph.io, you can choose to write your scraper in Ruby, Python, PHP or Perl so there's a good chance you're already familiar an available programming language. Since all of the code is hosted on GitHub you're probably also already familiar with how to share and collaborate on your scraper code.

morph.io provides great conveniences like taking care of saving your data, running your scraper regularly, and emailing you when there's a problem.

You can find out more in the morph.io documentation.

Now it's time to scrape

Make sure you have a GitHub account, then you can use it to sign in to morph.io and create a new scraper that downloads and saves the following information:

Required fields

The following fields are required. All development applications should have these bits of information.

Field Example value Description
council_reference TA/00323/2012 The ID that the council has given the planning application. This also must be the unique key for this data set.
address 1 Sowerby St, Goulburn, NSW The physical address that this application relates to. This will be geocoded so doesn't need to be a specific format but obviously the more explicit it is the more likely it will be successfully geo-coded. If the original address did not include the state (e.g. "QLD") at the end, then add it.
description Ground floor alterations to rear and first floor addition A text description of what the planning application seeks to carry out.
info_url https://foo.gov.au/app?key=527230 A URL that provides more information about the planning application. This should be a persistent URL that preferably is specific to this particular application. In many cases councils force users to click through a license to access planning application. In this case be careful about what URL you provide. Test clicking the link in a browser that hasn't established a session with the council's site to ensure users of Planning Alerts will be able to click the link and not be presented with an error.
date_scraped 2012-08-01 The date that your scraper is collecting this data (i.e. now). Should be in ISO 8601 format. Use the following Ruby code: Date.today.to_s

Note that there used to be a field comment_url above that was required. This is no longer used though you might still see it referenced in older scrapers.

Optional fields

The following fields are optional because not every planning authority provides them. Please do include them if data is available.

field Example value Description
date_received 2012-06-23 The date this application was received by council. Should be in ISO 8601 format.
on_notice_from 2012-08-01 The date from when public submissions can be made about this application. Should be in ISO 8601 format.
on_notice_to 2012-08-14 The date until when public submissions can be made about this application. Should be in ISO 8601 format.
comment_email foo@bar.com Only set this in extremely unusual situations. Allows each application in a single planning authority to go to a different email address. This should never be set for 99.9% of authorities as a single email address is used for all comments. Currently this is only used for SA Planning Portal where comments are ideally sent back to the originating local council so that the staff in state government don't have to do the redirection by hand.
comment_authority Acme Council Only set this in extremely unusual situations. Give the name associated with the comment_email address.

Versioning application data

It's important that scrapers collect the latest, most up-to-date, information. In fact, if the information about an application changes (because, for instance, a council updates the wording or corrects a mistake) your scraper should get the most up to date information.

For that reason, it's good practise for your scraper to look back a reasonable amount of time (one month is good) in which you scrape all applications that might have changed in that time. That way you're most likely to catch any changes. Often it's not possible to simply get a list of applications that recently changed. Instead you have to scrape say a list of applications that were recently received and applications that have recently been determined (whether they're approved or not).

When you save an updated version of an application make sure you use the council_reference field as the unique id. That way you don't end up with multiple versions of the same record. If you're writing your scraper in Ruby that will look something like:

ScraperWiki.save_sqlite(['council_reference'], record)

When the main Planning Alerts system reads the latest application data from your scraper on morph.io it automatically keeps track of the changes that occur on indidivual applications. That way you can make sure that nothing truly gets overwritten. There is always a history of what fields changed when. At the moment this information is recorded in the database but isn't yet exposed to users in the main application or through the data published through the API.

If you get stuck, have a look at the scrapers already written for Planning Alerts and post on the morph.io forum if you have any questions.

Scheduling the scraper

Set the scraper to run once per day. This can be done on morph.io on the settings page of the scraper.

Finishing up

Once you've finished your scraper and it's successfully downloading planning applications, contact us and we'll fork it into the planningalerts-scrapers organization and import it into Planning Alerts.

The last thing to do is look up on Wikipedia how many people live within the council you've just covered so you can pat yourself on the back knowing that you've just helped tens of thousands of people get Planning Alerts.