Binary Classification of Patent Text Using Natural Language Processing

alt text

Description

There were 193,460 European patent applications filed at the European Patent Office in 2022.

The EPO, and several other agencies are really interested in trends associated with the filings of patents to specific areas such as ‘Green Plastics’ (e.g., plastics that can be recycled, or that are made from biodegradable materials).

Typically, to identify whether a patent is related to a certain topic or not, a person would have to manually read through a patent application and assign classification labels to it based on their opinions. Patent applications can be hunderds of pages long, and with the sheer amount of applications that the EPO receive annually, it's easy to see why patent classification is a tedious task!

Hence, there is a need for quick and robust methods of accurately classifying the plethora of patents being submitted to the EPO to highlight any trends in ‘Green Plastics’ filings, or filings in any other areas of interest (e.g., renewable energies, artificial intelligence, augmented reality, drug discovery)

By employing machine learning, in the form of Natural Language Processing algorithms, the cost, and likelihood of misclassification of patents, in any technical area, can be significantly reduced, while speeding up the process.

To address the challenge of classifying patents, the EPO held its first ever Codefest, where it challenged entrants to develop creative and reliable artificial intelligence (AI) models for automating the identification of patents related to green plastics.

To enable contestants to develop their models, the EPO provided access to its extensive dataset of patents and patent classifications. From this, we created a smaller, binary classification dataset, with half of the entries being related to 'Green Plastics' patents, and the other half being related to other patent areas.

Learning Outcomes

What you'll learn from each Notebook:

Introduction (Start with this)

Why we need to classify patents.
What is Tensorflow?
Loading datasets into your workspace.
What Tokenisation, Vectorisation and Word Embeddings are in the context of NLP.
Methods to analyse and understand a dataset.
Training a model using the Term-Frequency - Inverse Document Frequency (TF-IDF) vectorisation technique with a Multinomial Bayes algorithm.

Multi-Layer Perceptron (Complete this after the Introduction Notebook)

What a MultiLayer Perceptron is, and how they can be used for text classification.
How to train a Multilayer Perceptron using Tensorflow's Keras.
Making predictions using a Multilayer Perceptron.
What are hyperparameters and how do they affect the training and performance of machine learning models.
Optimise a model's training pipeline using Callbacks
Visualise a model and plotting training loss curves.
Visualising the structure of a compiled model.
Evaluating the performance of a MultiLayer Perceptron.

Long Short Term Memory Networks (LSTMs) (Complete after Intro and Multi-Layer Perceptron Notebooks)

What a LSTM is, and how they can be used for text classification.
How to train a LSTM using Tensorflow's Keras.
Evaluating the performance of the LSTM.

One-Dimensional Convolutional Neural Networks (1D-CNN) (Complete after Intro and Multi-Layer Perceptron Notebooks)

What a 1D-CNN is, and how they can be used for text classification.
How to train a 1D-CNN using Tensorflow's Keras.
What a pooling layer is?
Evaluating the performance of the 1D-CNN.

Transformers (Complete after Intro, Multi-Layer Perceptron and LSTM Notebooks)

What a Transformer is, and how they can be used for text classification.
How to train a Transformer using Tensorflow's Keras and Object Oriented Programming.
What is attention?
What is a softmax layer?
Evaluating the performance of the Transformer.

Task	Time
Reading	25 hours
Practising	20 hours

Requirements

It would help a lot if you went through the following Graduate School courses before going through this exemplar:

Data Exploration and Visualisation

Data Processing with Python Pandas

Plotting in Python with Matplotlib

Introduction to Machine Learning

Mathematics for Machine Learning Specialisation (Coursera)

Academic

Access to Google Colaboratory
Basic Math (matrices, averages)
Programming skills (python, pandas, numpy, tensorflow)
Machine learning theory (at level of intro to machine learning course)

Getting Started

Just open up the 'Introduction_and_Data_Handling' notebook and click on the blue 'Open in Colab' button to get started.

Project Structure

├── Datasets
|   ├── GreenPatents_Dataset.csv
|   ├── NotGreenPatents_Dataset.csv
├── docs
|   ├── 1_Introduction_and_Data_Handling.ipynb
|   ├── 2_Multilayer_Perceptron_Classification.ipynb
|   ├── 3_LSTM_Classification.ipynb
|   ├── 4_Transfomer_Classification.ipynb
|   ├── 5_Convolutional_1D_Network_Classification.ipynb

License

This project is licensed under the BSD-3-Clause license