better_regulation/common-data-format

How can we bring legislation into a format suitable for machine learning?

Once we have derived the key components of a legislation and defined a list of attributes based on paragraph types, another key step is to create a common data format that will store components of a legislation in a stuctured manner, representing the entire text. This schema will store the existing text into a usable format that can later be used for machine learning. Our main source for legislative texts is legislation.gov.uk, which has an application programming interface (API) that allows us to navigate existing legislation and extract them in a variety of formats.

Creating a common data format

By exploring the existing API and based on our manual analysis of a variety of legislation, we can derive a logical skeleton structure of the attributes common to the legislative texts on legislation.gov.uk. Trying to create a common data format for all existing legislation is a difficult task, as there may be differences between primary and secondary legislation as well as across jurisdictions. The schema must be specific enough to account for the important components of a legislation but also flexible enough to allow for differences between different types of legislation and across jurisdictions. We first narrow the scope of our analysis to UK secondary legislation to simplify the process, leaving possibilities to extend our schema to both EU and primary legislation.

A main obstacle that we find is that the schema on the existing API of legislation.gov.uk lacks clear guidelines, hence it is difficult to derive a structure that is entirely accurate for all legislation. There may be special cases or inconsistencies between legislative texts that we will not be able to take into account without a manual assessment of all the texts. An additional matter to note is that before 1963, a uniform resource identifier (URI) does not uniquely identify a particular piece of legislation. In some cases, some legislation will not have a unique ID which makes it difficult to differentiate between them. However, since older legislation may not be as relevant as more recent legislation, it is simpler to consider only more recent legislation to validate our schema.

Validating the legislation schema

Once a structure has been defined for the legislative texts, legislation data needs to be extracted from legislation.gov.uk and parsed into the format that the data schema is designed to include. In our case, a JSON schema is used to describe the hierarchical data structures in a more concise way, and the texts are parsed from their XML format using the API. The JSON schema needs to be validated using a validation schema to ensure that the data types, descriptions and required fields of its data inputs are correct. Then, we need to compare the content of the legislative texts in XML format with its corresponding JSON format to ensure that we have extracted the full data from legislation.gov.uk correctly. This legislation schema needs to be validated using a large sample of legislations. After a discussion with our legal partners, Simmons & Simmons, we decided to consider a list of secondary legislation under the Financial Services and Markets Act 2000 (FSMA), since our current focus is on financial services and FSMA is an important piece of legislation in the field, with many secondary legislations stemming from it.

A limitation of using legislation.gov.uk as our main source is that the data content of the API has not been through the full set of quality checks before the documents are published so the completeness and accuracy of the data is not guaranteed. This makes it difficult to verify whether the format of the data in the API is entirely correct. In parsing the legislation data from its XML format into JSON format, we define a set of rules and formatting based on the API and our own analyses of the legislations. However, if the data content on the API is wrongly formatted, then it may lead to incorrect parsing of the texts and the legislation may not be converted into the correct format. This makes the process of validation much more tedious as the functions defined to parse the legislation data may not pass all test cases and would require manual verification to ensure we have extracted all parts of a legislative text.

The metadata in the API also contains statistics of the legislation, in terms of the number of paragraphs in all parts of the text, which can be used to verify our test cases. However, this data is also not always correct, so it may be the case that our function is valid but the incorrect statistics used to validate our test cases give us an erroneous result. Given the inaccuracies of the data in the API, it is difficult to ensure that we have correctly parsed every legislation from legislation.gov.uk into our database schema. We recognise the limitations of using legislation.gov.uk as our main source, however, we consider this to be the most complete and up-to-date archive of legislation that we have access to.