HK Court Lists Archive
A web application that automatically scrapes and archives Hong Kong court lists daily, the front-end application would offer the court list data as a searchable database, to the public for free, without restrictions of use.
What is the origin of this Project?
Hong Kong has a very limited amount of open legal data. Currently, http://legalref.judiciary.gov.hk offers a very limited amount of judgement and court documents, such as High Court judgement. Other private, pay-walled services like D-Law exist, but data is patchy and expensive (>$100 per document order).
All court cases are otherwise recorded in the court lists, as soon as the cases enter the justice system, at the "Mention." Currently, the court lists are available for 7 days only: 3 days before and after the current day. There is no publicly available archive.
As a matter of principal, justice could not exist without transparency. Open legal data is a crucial to a sound justice system.
What social problem are you trying to solve?
Journalists often learn of a case after that short window limit, such as from the GIS system, with limited information on the case. Once past the window, it would be impossible to search for the individual or the organization's name, case number, date, nature of charge.
The web app would be very useful not only to journalists hoping to pursue a case, or research an individual or an organization's background. It would also be useful to due diligence professionals, legal professionals, and the public in general.
How do we begin from scratch?
The Challenges we face?
- There may be challenges from by government or organization's on violation of privacy (although the only private info would be the name.)
- There may be government restriction on the use of legal data
- Long-term archive maintenance
- Long-term server space, and possibly server maintenance
What to do:
- Seek legal advice on privacy issues
- Build a scraper, possibly with the help of existing open source tools, at fixed daily intervals
- Build a database to store the data scraped
- Build a front-end web application, with data entry points: search by parties' name, date, court, nature of charge, etc. (Ref: Pacer.gov) then offer a full list of data available.
- Long-term database maintenance
- Might need fundraising efforts to hire coders for longer-term development, and server space rental
What resource do you need?
python ==> scrapy
manual => frequency
error > retry
server space estimation, data compression
SQL database for managing large datasets
crawl from different levels : e.g.
10,000 characters = 10kb per court per day
10,000 characters x 20 courts x 5 days x 50 weeks
= 50,000,000 characters
= approx 50 mb per year
computing speed as data accumulates in the long term?
- demo using morph.io (by Omar K.)
- This is very preliminary prototype. It currently tackles only one of 29 court case hearing lists.
- morph.io supports scraping the sites once per day, and provide download CSV and SQLite for further storage
- TODO: extend the scrape.py to tackle all courts case hearing list and setup to permanent Database hosting server
- wget (by Kennon Wong)
Who are we?
Selina Cheng, reporter