CSV Generator Pro - Testing Workflow Guide

📋 Overview

This guide provides comprehensive testing workflows for CSV Generator Pro, ensuring your generated data meets quality standards and works correctly with downstream systems like AWS S3, Athena, and Redshift.

Why Test?

Data Quality: Verify realistic data generation
Format Validation: Ensure correct CSV, NDJSON, or Parquet format
S3 Integration: Confirm uploads and partitioning work correctly
Performance: Validate generation speed and file sizes
Deterministic IDs: Verify consistent IDs across datasets

🎯 Basic Testing Workflow

Standard Test Flow

Step 1: Small Sample Test

Generate 10-100 rows to verify field selections and data quality.

Step 2: Preview Validation

Review data in preview table - check for realistic values, proper formatting.

Step 3: Format Export Test

Download in target format (CSV/NDJSON/Parquet) and verify file opens correctly.

Step 4: Scale Up Test

Generate target row count and measure generation time.

Step 5: Integration Test

Upload to S3 or import into target system and verify data loads correctly.

✅ Data Quality Testing

Pre-Generation Checklist

✓ Field Selection: Verify all required fields are selected

✓ Row Count: Set appropriate number (start small for testing)

✓ Output Format: Choose CSV, NDJSON, or Parquet based on use case

✓ Filename: Set descriptive filename for easy identification

✓ Date Settings: Enable random dates if needed for time-series data

Post-Generation Validation

1. Visual Inspection

Check the preview table for:

Realistic names (no gibberish)
Valid email formats (name@domain.com)
Proper phone formatting
Cities matching states/countries
Reasonable price/revenue values
Dates within expected ranges

2. Data Relationships

Verify related fields are consistent:

Email: Should be based on firstName + lastName
City/State/Country: Should match geographically
Product/Category: Should be logically related
Revenue: Should equal price × quantity (if all present)

3. Data Distribution

Use preview features to check:

Sort by field: Check for proper data spread (not all same values)
Filter/search: Verify specific values exist
Pagination: Spot-check records throughout dataset

📊 Format Testing

CSV Format Validation

Test Steps:

Generate 100 rows with multiple field types
Download CSV file
Open in Excel or text editor
Verify headers are present in first row
Check for proper comma separation
Verify text fields with commas are quoted
Import into database or analytics tool

NDJSON Format Validation

Test Steps:

Generate data and download NDJSON
Open in text editor
Verify each line is valid JSON object
Check no commas between lines
Test parsing with: jq . file.ndjson
Verify field names are consistent across lines

Parquet Format Validation

Test Steps:

Generate data and download Parquet file
Check file size (should be 50-90% smaller than CSV)
Upload to S3 for Athena testing
Query with Athena to verify schema
Check data types are preserved
Verify compression is working

Athena Test Query:

-- Create external table
CREATE EXTERNAL TABLE test_data (
    id INT,
    firstName STRING,
    lastName STRING,
    email STRING,
    country STRING
)
STORED AS PARQUET
LOCATION 's3://your-bucket/path/';

-- Test query
SELECT * FROM test_data LIMIT 10;

-- Verify schema
DESCRIBE test_data;

☁️ S3 Upload Testing

Pre-Upload Checklist

✓ AWS Credentials: Access Key ID and Secret Key configured

✓ Bucket Exists: Target S3 bucket is created

✓ IAM Permissions: User has PutObject permission

✓ CORS Configured: Bucket allows browser uploads

✓ Region Match: Selected region matches bucket region

✓ Path Template: S3 directory path uses correct placeholders

S3 Upload Test Workflow

Phase 1: Single File Test

Generate: Create 10 rows with date field
Configure: Set S3 credentials and bucket
Test Path: Use simple path like "test/data/"
Upload: Click "Upload to S3"
Verify: Check file appears in S3 console
Download: Download from S3 and verify content

Phase 2: Partitioning Test

Enable: Random dates over 3 months
Set Path: Use "data/year=yyyy/month=mm/"
Enable Splits: Check "Split by Date" (Monthly)
Upload: Generate and upload
Verify: Check multiple month directories created
Count: Verify files split correctly by month

Phase 3: Field-Based Partitioning Test

Add Fields: Include "category" or "country"
Set Path: Use "data/category={{category}}/"
Enable Splits: Check "Split by Fields", select category
Upload: Generate and upload
Verify: Check directories per category value
Validate: Files contain only their category's data

Troubleshooting S3 Uploads

Error	Likely Cause	Solution
Access Denied	IAM permissions issue	Verify IAM policy has PutObject, check bucket name in ARN
CORS Error	CORS not configured	Add CORS policy to bucket (see help guide)
Invalid Credentials	Wrong access key or secret	Double-check credentials, try regenerating
Bucket Not Found	Typo or wrong region	Verify exact bucket name and region
Slow Upload	Network or large file	Use file splitting, check connection speed

🔢 Deterministic ID Testing

Cross-Dataset ID Validation

Test Scenario:

Verify the same person gets the same ID in two different datasets

Steps:

Dataset 1 - Employees:
- Select: id, firstName, lastName, email, department
- Enable deterministic IDs (Standard method)
- Generate 100 rows
- Download as employees.csv
Dataset 2 - Customers:
- Select: id, firstName, lastName, email, country
- Enable deterministic IDs (Standard method)
- Generate 100 rows
- Download as customers.csv
Verification:
- Load both files into SQL database or spreadsheet
- Find common names (e.g., "John Smith")
- Verify they have identical IDs
- Test JOIN query works correctly

SQL Test Query:

-- Find matching people across datasets
SELECT 
    e.id,
    e.firstName,
    e.lastName,
    e.email,
    e.department,
    c.country
FROM employees e
INNER JOIN customers c ON e.id = c.id
WHERE e.firstName = 'John' 
  AND e.lastName = 'Smith';

ID Calculator Verification

Generate dataset with deterministic IDs enabled
Pick a specific person from the data (e.g., ID 42857)
Open ID Calculator tool
Enter their firstName, lastName, email
Use same method (Standard)
Click "Calculate ID"
Verify calculated ID matches dataset ID

ID Method Comparison Test

Method	Test Dataset Size	Expected Uniqueness
Basic	1,000 rows	Should see some ID collisions
Standard	50,000 rows	Very few or no collisions
Enhanced	500,000 rows	Virtually no collisions

⚡ Performance Testing

Generation Speed Benchmarks

Row Count	Expected Time	Test Result
100	< 1 second	✓ Pass / ✗ Fail
1,000	1-2 seconds	✓ Pass / ✗ Fail
10,000	5-10 seconds	✓ Pass / ✗ Fail
100,000	10-30 seconds	✓ Pass / ✗ Fail
500,000	30-60 seconds	✓ Pass / ✗ Fail

💡 Performance Tips

Close unnecessary browser tabs before large generations
Use Chrome or Edge for best performance
Fewer fields = slightly faster generation
Large datasets may show "not responding" - this is normal

File Size Comparison Test

Generate the same 10,000-row dataset in all three formats and compare sizes:

Format	Expected Size	Your Result	Compression Ratio
CSV	~2-3 MB	___ MB	Baseline (1.0x)
NDJSON	~3-4 MB	___ MB	~1.2-1.5x CSV
Parquet	~300-500 KB	___ KB	~0.1-0.2x CSV

🔄 Batch Processing Testing

Batch Test Workflow

Phase 1: Small Batch Test (3 configs)

Create 3 simple configurations:
- Config 1: 100 rows, CSV, no splits
- Config 2: 100 rows, NDJSON, split by date
- Config 3: 100 rows, Parquet, split by fields
Enable "Pause before each upload" for verification
Click "Batch Process All Configs"
Review each dataset before clicking "Next"
Verify all 3 configs complete successfully
Check S3 for all files if uploading

Phase 2: Large Batch Test (12 configs)

Click "Load Built-in Presets" to get all 12 optimized presets
Review each preset configuration
Disable pause mode for automatic processing
Click "Batch Process All Configs" (Generate Only)
Monitor progress through all 12 datasets
Review final summary for success/failed counts
Verify expected number of files generated

Batch Processing Validation

✓ All configs processed: Success count matches total configs

✓ No failures: Failed count is 0

✓ Skipped explained: Any skipped configs have valid reasons

✓ Split settings applied: Files split correctly per config

✓ S3 uploads complete: All files present in S3 (if uploading)

✓ Reasonable time: Total time is acceptable for dataset sizes

🔗 Integration Testing

Athena Integration Test

Generate & Upload: Create dataset with date field, split by date, upload to S3
Create Table: Define external table in Athena
Discover Partitions: Run MSCK REPAIR TABLE
Query Test: Run SELECT with partition filter
Verify Performance: Check "Data scanned" in query stats

Athena Test Script:

-- Create external table with partitions
CREATE EXTERNAL TABLE sales_data (
    id INT,
    firstName STRING,
    lastName STRING,
    email STRING,
    product STRING,
    category STRING,
    price DOUBLE,
    quantity INT,
    revenue DOUBLE
)
PARTITIONED BY (year STRING, month STRING, day STRING)
STORED AS PARQUET
LOCATION 's3://your-bucket/sales/';

-- Discover partitions
MSCK REPAIR TABLE sales_data;

-- Verify partitions created
SHOW PARTITIONS sales_data;

-- Test partition pruning
SELECT category, SUM(revenue) as total_revenue
FROM sales_data
WHERE year = '2024' AND month = '11'
GROUP BY category;

Redshift Integration Test

Generate: Create Parquet files with S3 partitioning
Upload: Use S3 upload with proper directory structure
Create Table: Define Redshift table schema
COPY Command: Load data from S3 using COPY
Verify: Query data and check row counts

Redshift Test Script:

-- Create table
CREATE TABLE sales_data (
    id INTEGER,
    firstName VARCHAR(50),
    lastName VARCHAR(50),
    email VARCHAR(100),
    product VARCHAR(100),
    category VARCHAR(50),
    price DECIMAL(10,2),
    quantity INTEGER,
    revenue DECIMAL(12,2)
);

-- COPY from S3
COPY sales_data
FROM 's3://your-bucket/sales/'
IAM_ROLE 'arn:aws:iam::account:role/RedshiftRole'
FORMAT AS PARQUET;

-- Verify load
SELECT COUNT(*) FROM sales_data;
SELECT category, COUNT(*) FROM sales_data GROUP BY category;

💡 Testing Best Practices

General Testing Principles

Start Small: Always test with 10-100 rows first
Incremental Testing: Test each feature separately before combining
Document Results: Keep notes on what works and what doesn't
Version Control: Export configurations as backups
Automation: Use batch processing for repetitive tests

Common Testing Mistakes to Avoid

⚠️ Avoid These Mistakes

Generating 100K+ rows without testing small sample first
Not verifying S3 credentials before large uploads
Forgetting to enable "Use random dates" for time-series data
Using wrong IAM permissions (missing PutObject)
Not checking CORS configuration before S3 uploads
Mixing deterministic ID methods across related datasets
Not validating Parquet files actually load in Athena

Test Data Cleanup

ℹ️ Remember to Clean Up

Delete test files from S3 to avoid storage costs
Remove test tables from Athena/Redshift
Clear browser downloads folder of test files
Remove unused configurations to keep presets organized

📚 Quick Reference Checklist

Before Any Test

✓ Fields selected

✓ Row count set

✓ Output format chosen

✓ Filename configured

Before S3 Upload Test

✓ AWS credentials configured

✓ Bucket exists and accessible

✓ IAM policy includes PutObject

✓ CORS configured on bucket

✓ Region matches bucket region

✓ S3 path uses valid placeholders

After Generation

✓ Preview data looks realistic

✓ Row count matches expected

✓ No obvious errors in data

✓ Download/upload buttons enabled