To analyse existing legislative texts, we need to consider the composition of legislations and define the key components that should be extracted from them. Essentially, legislations are a compilation of rules, consisting of rights and obligations. However, to develop a solution that is meaningful for our end users, we need to break legislative texts down into key components that will result in meaningful classifications of the text. By adopting this approach, businesses can better identify the requirements that affect them and the implications of these requirements, and policy makers can navigate existing regulations in a more efficient manner.
Legislative texts can be split into useful ‘chunks of text’ that we can derive key components from.
Specifically, each ‘chunk of text’ can be analysed with the questions in mind:
The size of ‘chunks’ to consider is subject to debate. The bigger ‘chunk’ we take, the more generalised our classification becomes and it is more difficult to conduct high level tagging of the legislation as it will not consider the content in detail. However, taking a very small ‘chunk’ is also problematic since subsections of a legislation are not mutually exclusive and trying to classify one paragraph on its own will oftentimes be less informative. This is especially problematic when paragraphs are not mutually exclusive and there are no direct cross references between them, e.g. “as mentioned in paragraph (2)”. From a technical point of view, it is difficult to understand that one paragraph is an extension or is directly related to another without cross references in either one of them. We believe that it would be easier for the development of our tool if the paragraphs that are supplementary information to a a right or obligation always include a cross-reference to the related requirement. We decided that considering ‘chunks of text’ on the paragraph level is the most informative since each paragraph indicates a specific provision that has a legislative effect.
Based on our own analysis of legislative texts, discussions with legislative experts and various government references, we are deriving a list of key attributes on a conceptual basis to create meaningful tags for paragraphs in legislation. Then, we can create a semantic regulatory taxonomy that can be applied to legislations on a general basis. Through this process, this taxonomy of definitions will need to be tested on sample legislation and refined through an iterative process. In our sample, we need to consider both primary and secondary legislation. Given the difference in structure and content between primary and secondary legislation, we expect to see different types of paragraphs. However, since many primary legislations also impose requirements directly on businesses, it is also important to consider them alongside secondary legislation.
In analysing regulations, we find a number of complexities that make it difficult to naturally classify the paragraphs into distinct attributes. We are currently working with our legal partners, Simmons & Simmons, to define these key attributes, paragraphs and sections on a conceptual basis as a semantic taxonomy that can be used to assess legislative text.
Firstly, determining what types and the number of attributes to consider is a challenging task. Assessing the structure of a legislative text helps us determine which sections will include rights and obligations and which will include supplementary information such as conditions, interpretations, sanctions and amendments. However, it is often the case that a paragraph can fall into multiple attribute types. For example, some paragraphs may impose both a right and an obligation, as a right for the authority is often an obligation for the business. This poses the question of who’s point of view we need to consider when classifying a paragraph. In the initial stage, we find it useful to consider an entity-neutral point of view in that we give a paragraph a classification that is based on the direct requirement implied from a paragraph. There are also cases that require a certain degree of domain knowledge as well as subjectivity to classify a paragraph and this makes it difficult to make clear classifications. It is important to define a set of rules that will determine how we differentiate paragraphs, for example, distinguishing a right from an obligation or a definition from an interpretation. We also find that the wording of paragraphs sometimes causes a difficulty in their classification as words may be used interchangeably or there are double negatives, which may make the language confusing.
Further refining the type of rights and obligations is also important to determine the different types of burdens that are placed on a business. It is difficult to categorise requirements into individual types, as choosing the categories in itself may be subjective and at times the lines may be blurred between one requirement category and another.
Our tool utilises machine learning techniques to classify ‘chunks of text’ into their attribute types. The number of attributes that our tool will pick up depends on the level of granularity that we want our tool to consider. Increasing the number of key attributes makes the tool pick up more requirements but may result in the algorithm having a lower accuracy rate and the tool being less generalisable to a broader range of legislative texts. However, having too few key attributes may result in a higher accuracy rate but less meaningful output and less requirements generated. This trade-off is subject to evaluation on both a conceptual and technical basis.
Given the detailed content in legislation, it is easy to make an attribute list that is too specific. We realise that starting out with a smaller list of key attributes that are more generalised is a good starting point to building the attribute list. Starting out with a more generalised list, and then testing out this taxonomy of definitions on a sample of several legislations in an iterative way allows us to update and refine the list successively to determine an optimised set of generalised attributes.