Data Management and Processing
Data Management in Research
Best Practices for Experimental Data Management
Good data management is essential for reproducible research. When working with experimental data, consider these practices:
- Consistent file naming: Use descriptive, consistent naming schemes (e.g.,
RUN_H_25C-100bar_7.xlsx
clearly indicates temperature and pressure conditions). - Data organisation: Organise data in a logical folder structure with clear separation between raw data and processed outputs.
- Metadata recording: Document experimental conditions, sample details, and measurement parameters.
- Version control: Track changes to your data processing scripts using version control systems such as Git.
- Data backup: Regularly back up your research data to prevent loss. Utilise cloud storage services (like OneDrive) when possible.
Data Structure for Time-Lag Analysis
For permeation experiments specifically, data should include:
- Time measurements (seconds)
- Pressure readings (bar or barg)
- Temperature readings (°C)
- Gas concentration measurements (ppm)
- Flow rates (ml/min)
- Sample dimensions (thickness, diameter)
Data Processing Workflow
The application implements a data processing pipeline in data_processing.py
consisting of several key steps:
1. Loading Data
Raw experimental data is loaded from Excel files using the load_data
function.
2. Data Preprocessing
The preprocess_data
function performs several preprocessing steps:
-
Baseline correction: Remove background signals from gas concentration measurements.
df['y_CO2_bl / ppm'] = df['y_CO2 / ppm'] - baseline
-
Pressure conversion: Convert pressure readings to standard units (bar).
df['P_cell / bar'] = df['P_cell / barg'] + 1.01325
-
Flux calculation: Calculate gas flux through the membrane from provided polymer disc thickness (
d_cm
) and N₂ sweeping gas flowrate (qN2_mlmin
).# Calculate Area of disc A_cm2 = (math.pi * d_cm**2) / 4 # [cm^2] # Specify mass flow rate of N2 in [ml/min] if qN2_mlmin is not None: df['qN2 / ml min^-1'] = qN2_mlmin elif 'qN2 / ml min^-1' not in df.columns: raise ValueError("Column 'qN2 / ml min^-1' does not exist in the DataFrame.") # Calculate flux if unit == 'cm^3 cm^-2 s^-1' or unit == 'None': df['flux / cm^3(STP) cm^-2 s^-1'] = (df['qN2 / ml min^-1'] / 60) * (df['y_CO2_bl / ppm'] * 1e-6) / A_cm2
-
Cumulative flux calculation: Integrate experimental flux over time to obtain cumulative flux.
df['cumulative flux / cm^3(STP) cm^-2'] = (df['flux / cm^3(STP) cm^-2 s^-1'] * df['t / s'].diff().fillna(0)).cumsum()
3. Stabilisation Time Detection
An important aspect of time-lag analysis is determining when steady-state diffusion has been reached. This is performed in identify_stabilisation_time
function.
The following steps are performed:
-
Calculates the gradient of the specified data column.
df['gradient'] = (df[column].diff() / df['t / s'].diff())
-
Examines changes in this gradient over a rolling window.
df['pct_change_mean'] = (df[column].diff() / df['t / s'].diff()).pct_change().abs().rolling(window=window).mean() df['pct_change_min'] = (df[column].diff() / df['t / s'].diff()).pct_change().abs().rolling(window=window).min() df['pct_change_max'] = (df[column].diff() / df['t / s'].diff()).pct_change().abs().rolling(window=window).max() df['pct_change_median'] = (df[column].diff() / df['t / s'].diff()).pct_change().abs().rolling(window=window).median()
-
Identifies when changes fall below a specified threshold.
stabilisation_index = df[((df['pct_change_mean'] <= threshold))].index[0] stabilisation_time = df.loc[stabilisation_index, 't / s']
Input, Output, and Configuration
Data Input
The application expects data files in the data/
directory with experimental data organised in Excel files. Sample data files are available in this location for reference.
data/
RUN_H_25C-100bar_7.xlsx
RUN_H_25C-100bar_8.xlsx
RUN_H_25C-100bar_9.xlsx
RUN_H_25C-200bar_2.xlsx
RUN_H_25C-50bar.xlsx
...
When using the application, ensure your data files adhere to the following specifications:
-
Required columns: Each file must contain, at minimum, the following data columns with the specified headers:
t / s
: Time, measured in seconds.P_cell / barg
: Cell pressure, measured in bar gauge.T / °C
: Temperature, measured in degrees Celsius.y_CO2 / ppm
: Carbon dioxide concentration, measured in parts per million.
-
Unit consistency: Ensure that units are consistent across all measurements within and between files intended for comparative analysis.
-
File format: Data must be provided in either Excel (
.xlsx
,.xls
) or CSV (.csv
) format.
Output Data
Following the pre-processing steps in preprocess_data.py
, the main analysis is performed in calculation.py
(explained in depth in 04-TimelagAnalysis-Implementation
). These steps are encompassed in the complete workflow function time_lag_analysis_workflow
in time_lag_analysis.py
. The workflow will be explained in depth in 08-Application-Workflow
. This workflow can generate several output files:
- Preprocessed data: Contains the cleaned and transformed experimental data.
- Time lag analysis results: Contains the calculated parameters (diffusion coefficient, permeability, etc.).
- Concentration profiles: Shows how gas concentration changes with position and time.
- Flux profiles: Shows the calculated gas flux over time.
Experimental Metadata Configuration
The util.py
file contains configuration dictionaries for experimental parameters:
thickness_dict = {
'RUN_H_25C-50bar': 0.1, 'RUN_H_25C-100bar_7': 0.1,
# ... other thickness values
} # [cm]
qN2_dict = {
'RUN_H_25C-50bar': 8.0, 'RUN_H_25C-100bar_7': 8.0,
# ... other flow rate values
} # [ml min^-1]
The dictionaries provide essential metadata for each experiment:
thickness_dict
: Membrane thickness in cmqN2_dict
: Nitrogen flow rate in ml/min
This separation centralises the metadata in util.py
, enhancing maintainability. It ensures consistency and simplifies updates across the analysis. For example, when calculating permeability for 'RUN_H_25C-100bar_7'
, the code retrieves the thickness 0.1
directly from thickness_dict
. If this value needed correction, it would only require changing it once in util.py
.
Data Flow Diagram
flowchart TD
A["Raw Data Files (.xlsx, .csv)"] --> B["Data Loading: <br>load_data()"]
B --> C["Data Preprocessing: <br>preprocess_data()"]
D["Membrane Data:<br>thickness_dict <br>qN2_dict"] --> E["Data Transformation:<br>- Baseline correction <br>- Unit conversion <br>- Flux calculation"]
C --> E
E --> F["Stabilisation Detection:<br>identify_stabilisation_time()"]
F --> G["Time-Lag Analysis:<br>time_lag_analysis_workflow()"]
G --> H["Preprocessed Data (.csv)"]
G --> I["Analysis Results (.csv)"]
G --> J["Profile Data (.csv)"]
Extending the Data Processing Pipeline
To implement your own data processing steps:
- Add new functions to
data_processing.py
. - Integrate them into the
preprocess_data
function. - Update the
time_lag_analysis_workflow
function to use your new processing steps.