๐ Overview
This guide provides comprehensive testing workflows for CSV Generator Pro, ensuring your generated data meets quality standards and works correctly with downstream systems like AWS S3, Athena, and Redshift.
Why Test?
- Data Quality: Verify realistic data generation
- Format Validation: Ensure correct CSV, NDJSON, or Parquet format
- S3 Integration: Confirm uploads and partitioning work correctly
- Performance: Validate generation speed and file sizes
- Deterministic IDs: Verify consistent IDs across datasets
๐ฏ Basic Testing Workflow
Standard Test Flow
Generate 10-100 rows to verify field selections and data quality.
Review data in preview table - check for realistic values, proper formatting.
Download in target format (CSV/NDJSON/Parquet) and verify file opens correctly.
Generate target row count and measure generation time.
Upload to S3 or import into target system and verify data loads correctly.
โ Data Quality Testing
Pre-Generation Checklist
Post-Generation Validation
1. Visual Inspection
Check the preview table for:
- Realistic names (no gibberish)
- Valid email formats (name@domain.com)
- Proper phone formatting
- Cities matching states/countries
- Reasonable price/revenue values
- Dates within expected ranges
2. Data Relationships
Verify related fields are consistent:
- Email: Should be based on firstName + lastName
- City/State/Country: Should match geographically
- Product/Category: Should be logically related
- Revenue: Should equal price ร quantity (if all present)
3. Data Distribution
Use preview features to check:
- Sort by field: Check for proper data spread (not all same values)
- Filter/search: Verify specific values exist
- Pagination: Spot-check records throughout dataset
๐ Format Testing
CSV Format Validation
Test Steps:
- Generate 100 rows with multiple field types
- Download CSV file
- Open in Excel or text editor
- Verify headers are present in first row
- Check for proper comma separation
- Verify text fields with commas are quoted
- Import into database or analytics tool
NDJSON Format Validation
Test Steps:
- Generate data and download NDJSON
- Open in text editor
- Verify each line is valid JSON object
- Check no commas between lines
- Test parsing with:
jq . file.ndjson - Verify field names are consistent across lines
Parquet Format Validation
Test Steps:
- Generate data and download Parquet file
- Check file size (should be 50-90% smaller than CSV)
- Upload to S3 for Athena testing
- Query with Athena to verify schema
- Check data types are preserved
- Verify compression is working
Athena Test Query:
โ๏ธ S3 Upload Testing
Pre-Upload Checklist
S3 Upload Test Workflow
Phase 1: Single File Test
- Generate: Create 10 rows with date field
- Configure: Set S3 credentials and bucket
- Test Path: Use simple path like "test/data/"
- Upload: Click "Upload to S3"
- Verify: Check file appears in S3 console
- Download: Download from S3 and verify content
Phase 2: Partitioning Test
- Enable: Random dates over 3 months
- Set Path: Use "data/year=yyyy/month=mm/"
- Enable Splits: Check "Split by Date" (Monthly)
- Upload: Generate and upload
- Verify: Check multiple month directories created
- Count: Verify files split correctly by month
Phase 3: Field-Based Partitioning Test
- Add Fields: Include "category" or "country"
- Set Path: Use "data/category={{category}}/"
- Enable Splits: Check "Split by Fields", select category
- Upload: Generate and upload
- Verify: Check directories per category value
- Validate: Files contain only their category's data
Troubleshooting S3 Uploads
| Error | Likely Cause | Solution |
|---|---|---|
| Access Denied | IAM permissions issue | Verify IAM policy has PutObject, check bucket name in ARN |
| CORS Error | CORS not configured | Add CORS policy to bucket (see help guide) |
| Invalid Credentials | Wrong access key or secret | Double-check credentials, try regenerating |
| Bucket Not Found | Typo or wrong region | Verify exact bucket name and region |
| Slow Upload | Network or large file | Use file splitting, check connection speed |
๐ข Deterministic ID Testing
Cross-Dataset ID Validation
Test Scenario:
Verify the same person gets the same ID in two different datasets
Steps:
- Dataset 1 - Employees:
- Select: id, firstName, lastName, email, department
- Enable deterministic IDs (Standard method)
- Generate 100 rows
- Download as employees.csv
- Dataset 2 - Customers:
- Select: id, firstName, lastName, email, country
- Enable deterministic IDs (Standard method)
- Generate 100 rows
- Download as customers.csv
- Verification:
- Load both files into SQL database or spreadsheet
- Find common names (e.g., "John Smith")
- Verify they have identical IDs
- Test JOIN query works correctly
SQL Test Query:
ID Calculator Verification
- Generate dataset with deterministic IDs enabled
- Pick a specific person from the data (e.g., ID 42857)
- Open ID Calculator tool
- Enter their firstName, lastName, email
- Use same method (Standard)
- Click "Calculate ID"
- Verify calculated ID matches dataset ID
ID Method Comparison Test
| Method | Test Dataset Size | Expected Uniqueness |
|---|---|---|
| Basic | 1,000 rows | Should see some ID collisions |
| Standard | 50,000 rows | Very few or no collisions |
| Enhanced | 500,000 rows | Virtually no collisions |
โก Performance Testing
Generation Speed Benchmarks
| Row Count | Expected Time | Test Result |
|---|---|---|
| 100 | < 1 second | โ Pass / โ Fail |
| 1,000 | 1-2 seconds | โ Pass / โ Fail |
| 10,000 | 5-10 seconds | โ Pass / โ Fail |
| 100,000 | 10-30 seconds | โ Pass / โ Fail |
| 500,000 | 30-60 seconds | โ Pass / โ Fail |
- Close unnecessary browser tabs before large generations
- Use Chrome or Edge for best performance
- Fewer fields = slightly faster generation
- Large datasets may show "not responding" - this is normal
File Size Comparison Test
Generate the same 10,000-row dataset in all three formats and compare sizes:
| Format | Expected Size | Your Result | Compression Ratio |
|---|---|---|---|
| CSV | ~2-3 MB | ___ MB | Baseline (1.0x) |
| NDJSON | ~3-4 MB | ___ MB | ~1.2-1.5x CSV |
| Parquet | ~300-500 KB | ___ KB | ~0.1-0.2x CSV |
๐ Batch Processing Testing
Batch Test Workflow
Phase 1: Small Batch Test (3 configs)
- Create 3 simple configurations:
- Config 1: 100 rows, CSV, no splits
- Config 2: 100 rows, NDJSON, split by date
- Config 3: 100 rows, Parquet, split by fields
- Enable "Pause before each upload" for verification
- Click "Batch Process All Configs"
- Review each dataset before clicking "Next"
- Verify all 3 configs complete successfully
- Check S3 for all files if uploading
Phase 2: Large Batch Test (12 configs)
- Click "Load Built-in Presets" to get all 12 optimized presets
- Review each preset configuration
- Disable pause mode for automatic processing
- Click "Batch Process All Configs" (Generate Only)
- Monitor progress through all 12 datasets
- Review final summary for success/failed counts
- Verify expected number of files generated
Batch Processing Validation
๐ Integration Testing
Athena Integration Test
- Generate & Upload: Create dataset with date field, split by date, upload to S3
- Create Table: Define external table in Athena
- Discover Partitions: Run MSCK REPAIR TABLE
- Query Test: Run SELECT with partition filter
- Verify Performance: Check "Data scanned" in query stats
Athena Test Script:
Redshift Integration Test
- Generate: Create Parquet files with S3 partitioning
- Upload: Use S3 upload with proper directory structure
- Create Table: Define Redshift table schema
- COPY Command: Load data from S3 using COPY
- Verify: Query data and check row counts
Redshift Test Script:
๐ก Testing Best Practices
General Testing Principles
- Start Small: Always test with 10-100 rows first
- Incremental Testing: Test each feature separately before combining
- Document Results: Keep notes on what works and what doesn't
- Version Control: Export configurations as backups
- Automation: Use batch processing for repetitive tests
Common Testing Mistakes to Avoid
- Generating 100K+ rows without testing small sample first
- Not verifying S3 credentials before large uploads
- Forgetting to enable "Use random dates" for time-series data
- Using wrong IAM permissions (missing PutObject)
- Not checking CORS configuration before S3 uploads
- Mixing deterministic ID methods across related datasets
- Not validating Parquet files actually load in Athena
Test Data Cleanup
- Delete test files from S3 to avoid storage costs
- Remove test tables from Athena/Redshift
- Clear browser downloads folder of test files
- Remove unused configurations to keep presets organized