8 L0 to L1 Data Processing Pipeline Documentation
8.1 Overview
The L0 to L1 data processing pipeline is designed to transform raw data as delivered (Level 0) into a cleaned, standardized format (Level 1) through a series of systematic steps. This process includes file structure validation, cleaning, standardization, and quality control measures.
8.2 System Architecture
8.2.1 Core Components
- BackupManager
- Handles data versioning and backup
- Provides verification and restoration capabilities
- Maintains a backup manifest with checksums
- Implements backup rotation (default: 5 versions)
- L0L1Processor
- Main processing engine
- Implements step-wise data transformation
- Maintains processing metadata and history
- Generates detailed processing logs and reports
8.3 Processing Steps
8.3.1 Step 1: Initialization
- Purpose: Set up processing environment and create initial backup
- Implementation:
- Scans input directory for CSV files
- Creates backup of raw data
- Initializes logging system
- Loads configuration settings
8.3.2 Step 2: Empty Element Analysis
- Purpose: Identify data quality issues
- Detects:
- Headerless columns (e.g., ‘Unnamed: X’)
- Completely null columns
- Empty columns
- Null rows
- Overall emptiness statistics
8.3.3 Step 3: Empty Element Cleaning
- Purpose: Remove problematic data elements
- Configurable Actions:
- Remove headerless columns
- Remove null columns
- Remove empty columns
- Remove null rows
- Decision Matrix:
{
'remove_headerless': true/false,
'remove_null_columns': true/false,
'remove_empty_columns': true/false,
'remove_null_rows': true/false
}
8.3.4 Step 4: Name Standardization
- Purpose: Ensure consistent column naming
- Standardization Rules:
- Convert spaces to underscores
- Remove special characters
- Normalize case (upper/lower)
- Remove duplicate underscores
- Configuration:
{
"patterns": [
["\\s+", "_"],
["[^\\w_]", ""],
["_{2,}", "_"]
],
"case": "lower"
}
8.3.5 Step 5: Missing Value Harmonization
- Purpose: Standardize missing value representations
- Default Missing Values:
- NA, N/A, None, NULL, “”
- “No Data”, “Unknown”
- “Not Available”
- -999, -9999, -999.0, -9999.0
8.3.6 Step 6: Export
- Purpose: Save processed data and documentation
- Outputs:
- Processed CSV files
- Processing summary report
- Detailed processing history
- HTML report with visualizations
8.4 Quality Control Measures
- Backup Verification
- Checksum validation
- File integrity checks
- Restoration capability
- Processing Metadata
- Detailed tracking of all changes
- Before/after statistics
- Decision tracking
- Timestamp logging
- Error Handling
- Custom exceptions
- Detailed error logging
- Rollback capabilities
8.5 File Structure
project/
├── input/
│ └── abr-nps-L0/ # Raw data (Level 0)
├── output/
│ └── abr-nps-L1/ # Processed data (Level 1)
├── backups/
│ ├── backup_manifest.json
│ └── backup_{timestamp}/
├── logs/
│ └── processing_{timestamp}.log
└── reports/
└── L0L1_processing_history.html
8.6 Reporting
The system generates comprehensive HTML reports containing: - Processing statistics - Data quality metrics - Column transformations - Missing value analysis - Processing decisions - Error logs
8.7 Best Practices
- Data Integrity
- Always create backups before processing
- Verify data integrity after each step
- Maintain detailed processing logs
- Configuration Management
- Use consistent configuration across datasets
- Document any configuration changes
- Version control configuration files
- Quality Control
- Review processing reports
- Validate outputs against expectations
- Monitor error logs
8.8 Usage Example
# Initialize processor with custom configuration
= {
config 'missing_values': [
'NA', 'N/A', 'None', 'NULL', '',
'No Data', 'Unknown', 'Not Available',
-999, -9999, -999.0, -9999.0
],'column_name_standardization': {
'patterns': [
r'\s+', '_'), # Convert spaces to underscores
(r'[^\w_]', ''), # Remove special characters
(r'_{2,}', '_') # Remove duplicate underscores
(
],'case': 'lower' # Convert to lowercase
}
}
# Initialize processor
= L0L1Processor(
processor ="./abr-nps-L0",
input_dir="./abr-nps-L1",
output_dir=None, # Using default config above
config_file=5
max_backups
)
# Step 1: Initialize and scan directory
= processor.initialize_processing()
scan_results print(f"Found {scan_results['files_found']} files")
# Step 2: Analyze empty elements
= processor.analyze_empty_elements()
empty_analysis
# Step 3: Set up cleaning decisions matrix
= {
cleaning_decisions 'abr-nps-annotations.csv': {
'remove_headerless': True, # Remove columns like 'Unnamed: X'
'remove_null_columns': True, # Remove completely null columns
'remove_empty_columns': False,# Keep partially empty columns
'remove_null_rows': True # Remove completely empty rows
},'abr_nps_els.csv': {
'remove_headerless': True,
'remove_null_columns': True,
'remove_empty_columns': False,
'remove_null_rows': True
}# ... repeat for other files
}
# Alternative: Apply same decisions to all files
= scan_results['file_list']
file_list = {
default_decisions 'remove_headerless': True,
'remove_null_columns': True,
'remove_empty_columns': False,
'remove_null_rows': True
}= {
cleaning_decisions
filename: default_decisions.copy() for filename in file_list
}
# Step 3: Clean empty elements
= processor.clean_empty_elements(cleaning_decisions)
cleaning_results
# Step 4: Standardize column names (using config defined above)
= processor.standardize_names()
standardization_results
# Example of standardization results:
# Original: "Site ID" → standardized: "site_id"
# Original: "Sample % (w/w)" → standardized: "sample_ww"
# Original: "GPS__Latitude" → standardized: "gps_latitude"
# Step 5: Harmonize missing values
= processor.harmonize_missing_values()
harmonization_results
# Step 6: Export processed files
= processor.export_processed_files()
export_results
# Generate detailed HTML report
= processor.get_processing_history()
history "./processing-logs/L0L1_processing_history.html")
generate_html_report(history,
# Example of accessing results
print("\nProcessing Summary:")
print("------------------")
for filename, results in cleaning_results.items():
print(f"\n{filename}:")
print(f"Original shape: {results['original_shape']}")
print(f"Cleaned shape: {results['cleaned_shape']}")
print(f"Columns removed: {results['changes']['columns_removed']}")
print(f"Rows removed: {results['changes']['rows_removed']}")
# Example of accessing standardization results
print("\nColumn Name Standardization:")
print("---------------------------")
for filename, results in standardization_results.items():
print(f"\n{filename}:")
for orig, new in results['name_changes'].items():
if orig != new: # Only show changed names
print(f"{orig} → {new}")