โ† Back to Index

๐Ÿงช Testing Workflow Guide

Best Practices for CSV Generator Pro

๐Ÿ“‹ Overview

This guide provides comprehensive testing workflows for CSV Generator Pro, ensuring your generated data meets quality standards and works correctly with downstream systems like AWS S3, Athena, and Redshift.

Why Test?

  • Data Quality: Verify realistic data generation
  • Format Validation: Ensure correct CSV, NDJSON, or Parquet format
  • S3 Integration: Confirm uploads and partitioning work correctly
  • Performance: Validate generation speed and file sizes
  • Deterministic IDs: Verify consistent IDs across datasets

๐ŸŽฏ Basic Testing Workflow

Standard Test Flow

Step 1: Small Sample Test

Generate 10-100 rows to verify field selections and data quality.

Step 2: Preview Validation

Review data in preview table - check for realistic values, proper formatting.

Step 3: Format Export Test

Download in target format (CSV/NDJSON/Parquet) and verify file opens correctly.

Step 4: Scale Up Test

Generate target row count and measure generation time.

Step 5: Integration Test

Upload to S3 or import into target system and verify data loads correctly.

โœ… Data Quality Testing

Pre-Generation Checklist

โœ“ Field Selection: Verify all required fields are selected
โœ“ Row Count: Set appropriate number (start small for testing)
โœ“ Output Format: Choose CSV, NDJSON, or Parquet based on use case
โœ“ Filename: Set descriptive filename for easy identification
โœ“ Date Settings: Enable random dates if needed for time-series data

Post-Generation Validation

1. Visual Inspection

Check the preview table for:

  • Realistic names (no gibberish)
  • Valid email formats (name@domain.com)
  • Proper phone formatting
  • Cities matching states/countries
  • Reasonable price/revenue values
  • Dates within expected ranges

2. Data Relationships

Verify related fields are consistent:

  • Email: Should be based on firstName + lastName
  • City/State/Country: Should match geographically
  • Product/Category: Should be logically related
  • Revenue: Should equal price ร— quantity (if all present)

3. Data Distribution

Use preview features to check:

  • Sort by field: Check for proper data spread (not all same values)
  • Filter/search: Verify specific values exist
  • Pagination: Spot-check records throughout dataset

๐Ÿ“Š Format Testing

CSV Format Validation

Test Steps:

  1. Generate 100 rows with multiple field types
  2. Download CSV file
  3. Open in Excel or text editor
  4. Verify headers are present in first row
  5. Check for proper comma separation
  6. Verify text fields with commas are quoted
  7. Import into database or analytics tool

NDJSON Format Validation

Test Steps:

  1. Generate data and download NDJSON
  2. Open in text editor
  3. Verify each line is valid JSON object
  4. Check no commas between lines
  5. Test parsing with: jq . file.ndjson
  6. Verify field names are consistent across lines

Parquet Format Validation

Test Steps:

  1. Generate data and download Parquet file
  2. Check file size (should be 50-90% smaller than CSV)
  3. Upload to S3 for Athena testing
  4. Query with Athena to verify schema
  5. Check data types are preserved
  6. Verify compression is working

Athena Test Query:

-- Create external table CREATE EXTERNAL TABLE test_data ( id INT, firstName STRING, lastName STRING, email STRING, country STRING ) STORED AS PARQUET LOCATION 's3://your-bucket/path/'; -- Test query SELECT * FROM test_data LIMIT 10; -- Verify schema DESCRIBE test_data;

โ˜๏ธ S3 Upload Testing

Pre-Upload Checklist

โœ“ AWS Credentials: Access Key ID and Secret Key configured
โœ“ Bucket Exists: Target S3 bucket is created
โœ“ IAM Permissions: User has PutObject permission
โœ“ CORS Configured: Bucket allows browser uploads
โœ“ Region Match: Selected region matches bucket region
โœ“ Path Template: S3 directory path uses correct placeholders

S3 Upload Test Workflow

Phase 1: Single File Test

  1. Generate: Create 10 rows with date field
  2. Configure: Set S3 credentials and bucket
  3. Test Path: Use simple path like "test/data/"
  4. Upload: Click "Upload to S3"
  5. Verify: Check file appears in S3 console
  6. Download: Download from S3 and verify content

Phase 2: Partitioning Test

  1. Enable: Random dates over 3 months
  2. Set Path: Use "data/year=yyyy/month=mm/"
  3. Enable Splits: Check "Split by Date" (Monthly)
  4. Upload: Generate and upload
  5. Verify: Check multiple month directories created
  6. Count: Verify files split correctly by month

Phase 3: Field-Based Partitioning Test

  1. Add Fields: Include "category" or "country"
  2. Set Path: Use "data/category={{category}}/"
  3. Enable Splits: Check "Split by Fields", select category
  4. Upload: Generate and upload
  5. Verify: Check directories per category value
  6. Validate: Files contain only their category's data

Troubleshooting S3 Uploads

Error Likely Cause Solution
Access Denied IAM permissions issue Verify IAM policy has PutObject, check bucket name in ARN
CORS Error CORS not configured Add CORS policy to bucket (see help guide)
Invalid Credentials Wrong access key or secret Double-check credentials, try regenerating
Bucket Not Found Typo or wrong region Verify exact bucket name and region
Slow Upload Network or large file Use file splitting, check connection speed

๐Ÿ”ข Deterministic ID Testing

Cross-Dataset ID Validation

Test Scenario:

Verify the same person gets the same ID in two different datasets

Steps:

  1. Dataset 1 - Employees:
    • Select: id, firstName, lastName, email, department
    • Enable deterministic IDs (Standard method)
    • Generate 100 rows
    • Download as employees.csv
  2. Dataset 2 - Customers:
    • Select: id, firstName, lastName, email, country
    • Enable deterministic IDs (Standard method)
    • Generate 100 rows
    • Download as customers.csv
  3. Verification:
    • Load both files into SQL database or spreadsheet
    • Find common names (e.g., "John Smith")
    • Verify they have identical IDs
    • Test JOIN query works correctly

SQL Test Query:

-- Find matching people across datasets SELECT e.id, e.firstName, e.lastName, e.email, e.department, c.country FROM employees e INNER JOIN customers c ON e.id = c.id WHERE e.firstName = 'John' AND e.lastName = 'Smith';

ID Calculator Verification

  1. Generate dataset with deterministic IDs enabled
  2. Pick a specific person from the data (e.g., ID 42857)
  3. Open ID Calculator tool
  4. Enter their firstName, lastName, email
  5. Use same method (Standard)
  6. Click "Calculate ID"
  7. Verify calculated ID matches dataset ID

ID Method Comparison Test

Method Test Dataset Size Expected Uniqueness
Basic 1,000 rows Should see some ID collisions
Standard 50,000 rows Very few or no collisions
Enhanced 500,000 rows Virtually no collisions

โšก Performance Testing

Generation Speed Benchmarks

Row Count Expected Time Test Result
100 < 1 second โœ“ Pass / โœ— Fail
1,000 1-2 seconds โœ“ Pass / โœ— Fail
10,000 5-10 seconds โœ“ Pass / โœ— Fail
100,000 10-30 seconds โœ“ Pass / โœ— Fail
500,000 30-60 seconds โœ“ Pass / โœ— Fail
๐Ÿ’ก Performance Tips
  • Close unnecessary browser tabs before large generations
  • Use Chrome or Edge for best performance
  • Fewer fields = slightly faster generation
  • Large datasets may show "not responding" - this is normal

File Size Comparison Test

Generate the same 10,000-row dataset in all three formats and compare sizes:

Format Expected Size Your Result Compression Ratio
CSV ~2-3 MB ___ MB Baseline (1.0x)
NDJSON ~3-4 MB ___ MB ~1.2-1.5x CSV
Parquet ~300-500 KB ___ KB ~0.1-0.2x CSV

๐Ÿ”„ Batch Processing Testing

Batch Test Workflow

Phase 1: Small Batch Test (3 configs)

  1. Create 3 simple configurations:
    • Config 1: 100 rows, CSV, no splits
    • Config 2: 100 rows, NDJSON, split by date
    • Config 3: 100 rows, Parquet, split by fields
  2. Enable "Pause before each upload" for verification
  3. Click "Batch Process All Configs"
  4. Review each dataset before clicking "Next"
  5. Verify all 3 configs complete successfully
  6. Check S3 for all files if uploading

Phase 2: Large Batch Test (12 configs)

  1. Click "Load Built-in Presets" to get all 12 optimized presets
  2. Review each preset configuration
  3. Disable pause mode for automatic processing
  4. Click "Batch Process All Configs" (Generate Only)
  5. Monitor progress through all 12 datasets
  6. Review final summary for success/failed counts
  7. Verify expected number of files generated

Batch Processing Validation

โœ“ All configs processed: Success count matches total configs
โœ“ No failures: Failed count is 0
โœ“ Skipped explained: Any skipped configs have valid reasons
โœ“ Split settings applied: Files split correctly per config
โœ“ S3 uploads complete: All files present in S3 (if uploading)
โœ“ Reasonable time: Total time is acceptable for dataset sizes

๐Ÿ”— Integration Testing

Athena Integration Test

  1. Generate & Upload: Create dataset with date field, split by date, upload to S3
  2. Create Table: Define external table in Athena
  3. Discover Partitions: Run MSCK REPAIR TABLE
  4. Query Test: Run SELECT with partition filter
  5. Verify Performance: Check "Data scanned" in query stats

Athena Test Script:

-- Create external table with partitions CREATE EXTERNAL TABLE sales_data ( id INT, firstName STRING, lastName STRING, email STRING, product STRING, category STRING, price DOUBLE, quantity INT, revenue DOUBLE ) PARTITIONED BY (year STRING, month STRING, day STRING) STORED AS PARQUET LOCATION 's3://your-bucket/sales/'; -- Discover partitions MSCK REPAIR TABLE sales_data; -- Verify partitions created SHOW PARTITIONS sales_data; -- Test partition pruning SELECT category, SUM(revenue) as total_revenue FROM sales_data WHERE year = '2024' AND month = '11' GROUP BY category;

Redshift Integration Test

  1. Generate: Create Parquet files with S3 partitioning
  2. Upload: Use S3 upload with proper directory structure
  3. Create Table: Define Redshift table schema
  4. COPY Command: Load data from S3 using COPY
  5. Verify: Query data and check row counts

Redshift Test Script:

-- Create table CREATE TABLE sales_data ( id INTEGER, firstName VARCHAR(50), lastName VARCHAR(50), email VARCHAR(100), product VARCHAR(100), category VARCHAR(50), price DECIMAL(10,2), quantity INTEGER, revenue DECIMAL(12,2) ); -- COPY from S3 COPY sales_data FROM 's3://your-bucket/sales/' IAM_ROLE 'arn:aws:iam::account:role/RedshiftRole' FORMAT AS PARQUET; -- Verify load SELECT COUNT(*) FROM sales_data; SELECT category, COUNT(*) FROM sales_data GROUP BY category;

๐Ÿ’ก Testing Best Practices

General Testing Principles

  • Start Small: Always test with 10-100 rows first
  • Incremental Testing: Test each feature separately before combining
  • Document Results: Keep notes on what works and what doesn't
  • Version Control: Export configurations as backups
  • Automation: Use batch processing for repetitive tests

Common Testing Mistakes to Avoid

โš ๏ธ Avoid These Mistakes
  • Generating 100K+ rows without testing small sample first
  • Not verifying S3 credentials before large uploads
  • Forgetting to enable "Use random dates" for time-series data
  • Using wrong IAM permissions (missing PutObject)
  • Not checking CORS configuration before S3 uploads
  • Mixing deterministic ID methods across related datasets
  • Not validating Parquet files actually load in Athena

Test Data Cleanup

โ„น๏ธ Remember to Clean Up
  • Delete test files from S3 to avoid storage costs
  • Remove test tables from Athena/Redshift
  • Clear browser downloads folder of test files
  • Remove unused configurations to keep presets organized

๐Ÿ“š Quick Reference Checklist

Before Any Test

โœ“ Fields selected
โœ“ Row count set
โœ“ Output format chosen
โœ“ Filename configured

Before S3 Upload Test

โœ“ AWS credentials configured
โœ“ Bucket exists and accessible
โœ“ IAM policy includes PutObject
โœ“ CORS configured on bucket
โœ“ Region matches bucket region
โœ“ S3 path uses valid placeholders

After Generation

โœ“ Preview data looks realistic
โœ“ Row count matches expected
โœ“ No obvious errors in data
โœ“ Download/upload buttons enabled