ReCoDE - Analysis of environmental literature with BERTopic and RoBERTa

Explosive literature in Environmental and Sustainability Studies

The field of environmental and sustainability studies has witnessed an explosive growth in literature over the past few decades, driven by the increasing global awareness and urgency surrounding environmental issues, climate change, and the need for sustainable practices.

This rapidly expanding body of literature is characterized by its interdisciplinary nature, encompassing a wide range of disciplines such as ecology, climate science, energy, economics, policy, sociology, and more. With a global focus and contributions from countries around the world, the literature base reflects diverse cultural, socio-economic, and geographical contexts, often in multiple languages. Novel research areas and emerging topics, such as circular economy, sustainable urban planning, environmental justice, biodiversity conservation, renewable energy technologies, and ecosystem services, continue to arise as environmental challenges evolve and our understanding deepens. The development of environmental policies, regulations, and international agreements, as well as increased public interest and awareness, have further fueled research and the demand for literature aimed at informing and engaging various stakeholders. Technological advancements in areas like remote sensing, environmental monitoring, and computational modelling have enabled new avenues of research and data-driven studies, contributing to the proliferation of literature. The rise of open access publishing and digital platforms has facilitated the dissemination and accessibility of this constantly evolving and interdisciplinary body of knowledge.

So, in summary, the explosive growth of the literature across multiple disciplines, geographic regions, languages, and emerging topics poses significant challenges in terms of effectively organizing, synthesizing, and extracting insights from this vast and rapidly expanding body of knowledge. This is where Natural Language Processing (NLP) techniques like topic modelling with BERTopic and advanced language models like RoBERTa can play a crucial role. Their ability to process large volumes of text data, identify semantic topics and patterns, cluster related documents, and handle multiple languages can help researchers, policymakers, and stakeholders navigate this extensive literature more effectively.

Furthermore, as a STEMM PhD student at Imperial stepping into a new field such as Sustainability, taking advantage of the NLP tools can significantly enhance the efficiency of literature exploration and review. This skill facilitates a seamless transition into interdisciplinary research, empowering you to navigate diverse datasets and extract valuable insights with greater ease and precision.

The Potential of Topic Modelling

Topic modelling is a technique in NLP and machine learning used to discover abstract "topics" that occur in a collection of documents. The key idea is that documents are made up of mixtures of topics, and that each topic is a probability distribution over words.

More specifically, topic modelling algorithms like Latent Dirichlet Allocation (LDA) work by:

Taking a set of text documents as input.
Learning the topics contained in those documents in an unsupervised way. Each topic is represented as a distribution over the words that describe that topic.
Assigning each document a mixture of topics with different weights/proportions.

For example, if you ran topic modelling on a set of news articles, it may discover topics like "politics", "sports", "technology", etc. The "politics" topic would be made up of words like "government", "election", "policy" with high probabilities. Each document would then be characterized as a mixture of different proportions of these topics.

The key benefits of topic modelling include:

Automatically discovering topics without need for labeled data
Understanding the themes/concepts contained in large document collections
Organizing, searching, and navigating over a document corpus by topics
Providing low-dimensional representations of documents based on their topics

Topic modelling has found applications in areas like information retrieval, exploratory data analysis, document clustering and classification, recommendation systems, and more. Popular implementations include Latent Dirichlet Allocation (LDA), Biterm Topic Model (BTM), and techniques leveraging neural embeddings like BERTopic.

Learning Outcomes

By the end of this tutorial, students will be able to acquire the following learning outcomes:

Proficiency in Text Data Preprocessing: Participants will gain hands-on experience in preprocessing environmental literature datasets, including cleaning, tokenisation, and normalisation techniques, essential for preparing data for NLP analysis.
Understanding the principle of embedding-matrix-based NLP techniques: Through the application of BERTopic for topic modelling and RoBERTa for sentiment analysis, students will develop a deep understanding of advanced NLP methods and their practical implementation in dissecting environmental and sustainability texts and beyond.
Critical Analysis Skills: Participants will learn to critically analyse and interpret the results of NLP analyses, including identifying dominant themes, sentiment shifts, and trends in environmental literature, fostering a nuanced understanding of environmental discourse.
Interpretation and Application: Relying on a real-world example, this project demonstrates how to generate visualisations and reports to present the results of the topic modelling and sentiment analysis, facilitating interpretation and discussion.

Requirements

It would help a lot if you went through the following Graduate School courses before going through this exemplar: * Introduction to Python * Data Exploration and Visualisation * Data Processing with Python Pandas * Plotting in Python with Matplotlib * Binary Classification of Patent Text Using Natural Language Processing (another ReCoDE project)

Academic

Access to Google Colaboratory
Basic Math (matrices, averages)
Programming skills (python, pandas, numpy, tensorflow)
Machine learning theory (at level of intro to machine learning course)

System

Windows, MacOS, Ubuntu Python 3.11 or higher Ideally with GPU for fast running of the code

NB: If you have access to High Performance Computing (HPC), we have prepared a specially adapted file for Imperial HPC environments, located under the "notebook" directory. This file is optimized to leverage the computational power and resources available through HPC, enabling more efficient processing and faster execution of your tasks.

Getting Started

Colab

Please visit this Colab page to access the detailed content of this tutorial: https://colab.research.google.com/drive/1vJzmFTFurlK-NGDw_fhJgxSmcKSZooLn?usp=sharing

A Step-by-Step Case Study using BERTopic to Analyze One web of Science Dataset

In this step-by-step case study, we will focus on the application of BERTopic, to analyze a sample dataset sourced from Web of Science. Through this tutorial, we aim to guide you through the process:

Installation and setup of BERTopic
Collecting the raw data and preprocessing the dataset
Implementing BERTopic for topic modeling
Visualizing the inferred topics and interpreting the results
Fine-tuning topic representations
Additional readings about the wider application of BERTopic

By following along, you will gain practical insights into leveraging BERTopic for insightful analysis of scholarly literature from Web of Science.

Some sample visualisation results can be: Example Image

A Step-by-Step Case Study using RoBERTa

Similar to what we have done above, we need to follow the following steps when applying a RoBERTa model.

RoBERTa Initialization: Initializes RoBERTa tokenizer and model
Data Preparation: Loads and preprocesses the dataset
Batch Tokenization: Tokenizes abstracts in batches
Embedding Generation: Generates embeddings using RoBERTa, and save it
Topic Modeling: Applies BERTopic with RoBERTa embeddings
Improve and fine-tune
Visualization

This section focuses on integrating RoBERTa into the topic modeling pipeline, enhancing its analytical capabilities.