The Cambridge Law Corpus: A Corpus for Legal AI Research

AI-generated keywords: Legal AI Cambridge Law Corpus CLC GPT-3 OCR

AI-generated Key Points

  • The Cambridge Law Corpus (CLC) is a valuable resource for legal AI research
  • Consists of over 250,000 court cases from the UK
  • Includes cases from as far back as the 16th century, providing a comprehensive historical perspective
  • First release of the corpus includes raw text and meta-data
  • Annotations on case outcomes provided for 638 cases conducted by legal experts
  • GPT-3, GPT-4, and RoBERTa models were trained and evaluated using annotated data to establish benchmarks for case outcome extraction
  • Extensive legal and ethical discussion included in the paper due to sensitive nature of legal materials
  • Corpus will only be released for research purposes under certain restrictions to ensure responsible use
  • Creation and curation process involved cleaning and transforming Microsoft Word and PDF files into XML format
  • OCR used to convert PDF files into textual form using Tesseract engine
  • Query-driven approach employed for annotation and curation due to size of corpus
  • CLC project page provides example data and terms of use for interested researchers
  • Specialized large language models like BERT have been developed specifically for legal applications
Also access our AI generated: Comprehensive summary, Lay summary, Blog-like article; or ask questions about this paper to our AI assistant.

Authors: Andreas Östling, Holli Sargeant, Huiyuan Xie, Ludwig Bull, Alexander Terenin, Leif Jonsson, Måns Magnusson, Felix Steffek

License: CC BY 4.0

Abstract: We introduce the Cambridge Law Corpus (CLC), a corpus for legal AI research. It consists of over 250 000 court cases from the UK. Most cases are from the 21st century, but the corpus includes cases as old as the 16th century. This paper presents the first release of the corpus, containing the raw text and meta-data. Together with the corpus, we provide annotations on case outcomes for 638 cases, done by legal experts. Using our annotated data, we have trained and evaluated case outcome extraction with GPT-3, GPT-4 and RoBERTa models to provide benchmarks. We include an extensive legal and ethical discussion to address the potentially sensitive nature of this material. As a consequence, the corpus will only be released for research purposes under certain restrictions.

Submitted to arXiv on 21 Sep. 2023

Ask questions about this paper to our AI assistant

You can also chat with multiple papers at once here.

AI assistant instructions?

Results of the summarizing process for the arXiv paper: 2309.12269v1

The Cambridge Law Corpus (CLC) is a valuable resource for legal AI research, consisting of over 250,000 court cases from the UK. The corpus includes cases from as far back as the 16th century, providing a comprehensive historical perspective. In this paper, the authors present the first release of the corpus, which includes raw text and meta-data. Additionally, they provide annotations on case outcomes for 638 cases conducted by legal experts. To establish benchmarks for case outcome extraction, the authors trained and evaluated GPT-3, GPT-4 and RoBERTa models using their annotated data. This allows researchers to compare their own models against these established benchmarks. Considering the sensitive nature of legal materials, an extensive legal and ethical discussion is included in this paper. As a result, the corpus will only be released for research purposes under certain restrictions to ensure responsible use. The creation and curation process of the CLC involved cleaning and transforming Microsoft Word and PDF files into an XML format. Optical character recognition (OCR) was used to convert PDF files into textual form using the Tesseract engine. The resulting text files were then converted to XML format with original documents stored separately for quality control purposes. Due to the size of the corpus manual annotation or curation of entire dataset was not feasible so a query-driven approach inspired by Voormann and Gut (2008) was employed instead. This iterative process focused on improving various aspects of corpus through small incremental steps such as adding new annotations metadata cases correcting errors etcetera. The CLC project page provides example data terms of use interested researchers at HTTPS://WWW.CST.CAMACUK/RESEARCH/SRG/PROJECTS/LAW . Legal language poses unique challenges due to its specialized terminology strong semantics Specialized large language models like BERT have been developed specifically for legal applications (Chalkidis et al., 2020; Masala et al., 2021; Zheng et al., 2021).
Created on 06 Nov. 2023

Assess the quality of the AI-generated content by voting

Score: 0

Why do we need votes?

Votes are used to determine whether we need to re-run our summarizing tools. If the count reaches -10, our tools can be restarted.

Similar papers summarized with our AI tools

Navigate through even more similar papers through a

tree representation

Look for similar papers (in beta version)

By clicking on the button above, our algorithm will scan all papers in our database to find the closest based on the contents of the full papers and not just on metadata. Please note that it only works for papers that we have generated summaries for and you can rerun it from time to time to get a more accurate result while our database grows.

Disclaimer: The AI-based summarization tool and virtual assistant provided on this website may not always provide accurate and complete summaries or responses. We encourage you to carefully review and evaluate the generated content to ensure its quality and relevance to your needs.