Throughout Phase 1 of the BEIS GovTech Challenge, we have explored the potential for building a tool to analyse the stock of existing regulations and identify requirements for businesses. We were able to use Natural Language Processing and Machine Learning to assess the technical feasibility of this tool. This process also took into account a wide view of user requirements that allowed us to refine the key features in an iterative manner.
We previously discussed the process of conducting user research and the opportunities and challenges brought along with it. Based on this, we discovered that policy makers find it particularly difficult to understand and visualise the overall burdens for particular business sectors. Furthermore, due to the concurrent introduction of policies by different departments, it is a challenge to track the increasing burdens on businesses. Government lawyers stated that there is a lack of consolidation of current legislation and their amendments, and there is no drafting tool that facilitates this consolidation with proposed amendments. Both government lawyers and policy makers find that it is difficult to identify overlapping regulations. Overall, all users pointed out the lack of consistency in the drafting process, in terms of formatting as well as structure.
Among the proposed solutions to these challenges, we discovered the opportunity for a drafting tool where proposed amendments can be consolidated with the most recent version of a legislation. However, through our user research we found that a drafting tool is already implemented in the EU and it is open source. After showing users wireframes of our proposed solutions, we found that users are particularly interested in being able to visualise and see analytics of regulations and bodies of legislation. Policy makers expressed that they are interested in understanding the regulatory landscape for broad regulatory areas such as Health and Safety, Tax, Employment, etc. However, there is no predefined list of these regulatory areas and directly implementing this feature into our tool would require building another regulatory taxonomy based on a representative sample of legislation. While we aim to cover this in Phase 2, we managed to incorporate another feature into our Phase 1 prototype that partially solves this problem. Since oftentimes policymakers are unsure of the exact concept they intend to search for, we adopted a similarity search by using word embeddings to generate similar concepts based on words searched. This search is then linked to sector classifications as represented by their 2-digit Standard Industrial Classification (SIC) code. In this way, users who are unfamiliar with SIC codes will also be able to navigate through them in a more efficient way to get a view of the regulatory landscape for different business sectors.
Following our Mid-Project review in Phase 1, we focused primarily on individuals involved in the policy making process as users. However, we also consider businesses as potential users, as we believe our tool could aid those who read and have to comply with legislation. We recognised that, oftentimes legislation places a disproportionate burden on SMEs, which are more limited in resources in comparison to larger businesses. As a RegTech firm, many of our clients are small to medium-sized financial institutions, and from this experience we understand the difficulty of navigating and analysing regulations and understanding regulatory requirements. Hence, we are endeavouring to build a tool that would alleviate this burden for SMEs, and therefore we avoid implementing features which would skew its accessibility or usability.
Implementing Machine Learning
Users expressed an interest in being able to see different types of clauses highlighted within a legislation as well as identifying the number of obligations imposed in legislation for a particular business sector. In Phase 1, we began building a tool that classifies chunks of text in a legislation into clause types. We then began generating a training data set tagged and labelled with these key attributes.
Using legislative data, we applied our regulatory taxonomy through semi-automated techniques to generate a training data set tagged and labelled with these attributes. We leveraged a human-in-the-loop approach that utilises both human and machine intelligence to improve the model and generate highly accurate training data. By using prelabeled examples as training data, a machine learning algorithm can learn the different associations between paragraphs in a legislation and that a particular attribute is expected for a specified input, in our case a paragraph. Once it is trained with enough training samples, the machine learning model can begin to make accurate predictions. In order to continuously improve the model, we must train the model iteratively over a large sample of paragraphs. So far, we have trained our model on a sample of approximately 1700 examples of paragraphs in legislation. We found that some clause types perform better than others. For example, the model is currently very good at predicting obligations (burdens) but finds it particularly difficult to identify consequences or sanctions for noncompliance, as these clauses are few and far between.
The model performance metrics for a model classifying rights, obligations, definitions and scope clauses shows an accuracy rate of 88%. In Phase 2, the training data will be extended significantly to improve the metrics of our model. The clause types could also be more tailored to what users would like to see. Furthermore, we aim to gain deeper insights into burdens by creating a taxonomy of types of burdens and using supervised learning in a similar way.