📑 Table of Contents
🎯 Overview & Key Features
CSV Generator Pro is a powerful, browser-based tool for generating realistic test data with advanced features for data engineering, analytics testing, and development workflows.
Core Capabilities
- 41+ Field Types - Generate diverse data including names, addresses, products, transactions, and more
- Multi-Format Support - Export as CSV, NDJSON, or Parquet files
- Import & Convert - Load existing data files and convert between formats with auto-field detection
- Deterministic IDs - Generate consistent IDs for SQL joins across datasets
- AWS S3 Integration - Direct upload with dynamic path templating and Hive-style partitioning
- File Splitting - Automatic split by date fields or custom field values
- Batch Processing - Process multiple configurations automatically
- Configuration Management - Save and load presets with 12 built-in templates
🆕 What's New in v2.9.2
Auto-Add Unknown Fields - New checkbox in Import Data section allows automatic addition of custom fields from imported files. Enable to automatically add any fields not in the standard list, or disable to reject files with incompatible fields.
🚀 Getting Started
Quick Start (30 seconds)
- Open
csv-generator-pro.htmlin your browser - Select fields you want (or click "Select Common")
- Set number of rows (default: 1000)
- Click "Generate Data"
- Click "Download CSV" to save the file
📂 Import Data UPDATED v2.9.2
Import existing CSV, NDJSON, JSON, or Parquet files to convert formats, modify data, or extend datasets.
Supported Formats
- CSV - Comma-separated values with header row
- NDJSON/JSON - Newline-delimited JSON (one object per line)
- Parquet - Apache Parquet columnar format
Auto-Add Unknown Fields Feature NEW
🔧 How It Works
The "Auto-add unknown fields" checkbox (enabled by default) controls how the import handles fields not in the standard field list:
✅ When Enabled (Checked)
- Automatically adds any custom fields from your file to the available fields list
- Added fields appear in the field selection grid
- Custom fields are treated as text/string data
- Fields persist until you refresh the page
- Perfect for importing data with company-specific or custom field names
❌ When Disabled (Unchecked)
- Rejects files containing fields not in the standard list
- Shows detailed error message with incompatible field names
- Provides list of available fields for reference
- Suggests enabling auto-add or modifying the file
Import Process
- Click "Choose File" in the Import Data section
- Select your CSV, NDJSON, JSON, or Parquet file
- If using custom fields, ensure "Auto-add unknown fields" is checked
- The file will be validated and loaded automatically
- Fields from the file will be auto-selected in the grid
- Review the data in the preview table
- Choose output format and download or upload to S3
customer_id, order_total, and shipping_method. With auto-add enabled, these fields will be automatically added to the selection grid and you can export to CSV for Excel analysis.
Field Validation
The import system validates all field names and provides detailed feedback:
- Compatible fields - Standard fields that exist in the default list
- Custom fields - Fields automatically added when auto-add is enabled
- Incompatible fields - Fields rejected when auto-add is disabled
- Custom fields added during import will not have data generation capabilities
- Custom fields are treated as text values from the imported data
- Refresh the page to reset to default fields only
- Large Parquet files (100k+ rows) may take a few seconds to parse
🎯 Field Selection
Choose from 41+ field types to customize your dataset:
Available Fields by Category
| Category | Fields |
|---|---|
| Identifiers | id, uuid, username |
| Personal Info | firstName, lastName, name, email, phone, age, gender |
| Location | address, city, state, country, zipCode, latitude, longitude |
| Business | company, jobTitle, department, salary, hireDate, startDate |
| Products | product, sku, category, subcategory, price, quantity, revenue |
| Dates & Times | date, timestamp, hireDate, startDate |
| Status & Ratings | status, priority, rating, boolean, isActive, score |
| Text Fields | description, notes, url, website, ipAddress |
Selection Controls
- Select All - Selects all 41+ fields
- Deselect All - Clears all selections
- Select Common - Selects frequently used fields (id, name, email, phone, city, country, date, status)
⚙️ Generation Controls
Number of Rows
Set how many records to generate (1 - 1,000,000). Recommended limits:
- 1-10,000 - Quick testing and development
- 10,000-100,000 - Medium-scale testing
- 100,000-500,000 - Performance testing and analytics
- 500,000+ - Large-scale data warehouse simulation
Filename
Specify the base filename for downloads and S3 uploads. Extensions are added automatically based on output format.
customer_data will create customer_data.csv, customer_data.ndjson, or customer_data.parquet depending on your format selection.
📅 Date Options
Control how date fields are generated in your dataset.
Fixed Date
- All date fields use the same date
- Useful for daily snapshots
- Default: Today's date
Random Dates
- Each record gets a random date within your specified range
- Perfect for historical data and time-series analysis
- Set start and end dates to define the range
- Ideal for creating partitioned data sets
🔢 Deterministic IDs
Generate consistent, predictable IDs that enable SQL joins across multiple datasets.
How It Works
Deterministic IDs are created by hashing specific field combinations:
- Basic Method - Uses: firstName + lastName
- Standard Method - Uses: firstName + lastName + email
- Enhanced Method - Uses: firstName + lastName + email + date
- Auto Method - Automatically selects the best available fields
Use Cases
- Generate "customers.csv" with 1000 records using Standard method
- Generate "orders.csv" with 5000 records using same Standard method
- The same customer (e.g., "John Smith" with "john.smith@example.com") gets the same ID in both tables
- You can now JOIN the tables on the ID field in SQL
Configuration
- Enable Deterministic IDs - Toggle this checkbox to activate
- Method Selection - Choose which fields to use for ID generation
- ID Range - Set the maximum ID value (default: 10,000,000)
🧮 ID Calculator
Calculate deterministic IDs without needing the original dataset - perfect for reverse lookups and verification.
Using the Calculator
- Click "Open ID Calculator" button
- Select the method that was used to generate your data
- Fill in the required fields (form adapts to method selected)
- Click "Calculate ID" to get the deterministic ID
Example Scenario
Problem: You generated customer data last week but need to know what ID was assigned to "Sarah Johnson" with email "sarah.j@example.com"
Solution: Open the ID Calculator, select "Standard" method, enter "Sarah", "Johnson", and "sarah.j@example.com", and click Calculate to instantly get her ID.
📄 Output Format Options
CSV (Comma-Separated Values)
- Universal compatibility with Excel, databases, and analytics tools
- Human-readable text format
- Includes header row with field names
- Best for: General use, spreadsheet analysis, simple data exchange
NDJSON (Newline-Delimited JSON)
- One JSON object per line
- Streaming-friendly format
- Perfect for log analysis and event data
- Best for: APIs, log processing, streaming applications
Parquet (Apache Parquet)
- Columnar storage format optimized for analytics
- Excellent compression ratios (typically 10-100x smaller than CSV)
- Fast query performance in Athena, Redshift, Spark
- Preserves data types and schema information
- Best for: Data warehouses, big data analytics, AWS Athena
⬇️ Download Files
Save generated data to your local computer.
Download Options
- Single File - Download all data in one file
- Multiple Files - When file splitting is enabled, downloads a ZIP containing all split files
Process
- Generate your data
- Select output format (CSV, NDJSON, or Parquet)
- Click "Download" button (specific to format selected)
- File(s) will be saved to your browser's download folder
☁️ AWS S3 Upload
Upload generated data directly to Amazon S3 with advanced path templating and partitioning.
Required Credentials
- S3 Bucket Name - Your bucket name (e.g., "my-data-bucket")
- AWS Region - Select from dropdown (e.g., us-east-1)
- Access Key ID - Your AWS access key
- Secret Access Key - Your AWS secret key
S3 Directory Path (Dynamic Templating)
Use placeholders to create dynamic, organized directory structures:
Date Placeholders
sales/year=yyyy/month=mm/day=dd/
└── Hive-style partitioning: year=2024/month=11/day=02/
Field Placeholders
customers/country={{country}}/status={{status}}/
└── Dynamic grouping: country=USA/status=active/
Combined Approach
transactions/category={{category}}/year=yyyy/month=mm/
└── Both field and date partitioning for optimal query performance
Upload Modes
- Quick Upload - Upload all files at once
- Step-by-Step - Upload one file at a time with manual control
Testing & Debugging
- Test CORS - Verify your bucket's CORS configuration before uploading
- Enable Console - See detailed upload logs and error messages
✂️ Split Files Feature
Automatically split your dataset into multiple files based on date or field values - perfect for partitioned data lakes.
Split by Date
- Creates separate files for each unique date in your dataset
- Requires date placeholders in S3 directory path (yyyy/mm/dd)
- Must use "Random Dates" option to generate multiple date values
- Perfect for daily/monthly/yearly partitions
Split by Fields
- Creates separate files for each unique combination of field values
- Requires field placeholders in S3 path ({{fieldname}})
- Example: Split by country and status creates files for each country/status combo
- Ideal for Hive-style partitioning in data warehouses
Configuration:
- S3 Path: sales/category={{category}}/year=yyyy/month=mm/
- Split by Fields: ✓ Enabled
- Split by Date: ✓ Enabled
- Random Dates: 2023-01-01 to 2025-12-31
Result:
Files organized like:
└── sales/
├── category=Electronics/year=2024/month=11/sales.csv
├── category=Electronics/year=2024/month=10/sales.csv
├── category=Clothing/year=2024/month=11/sales.csv
└── category=Clothing/year=2024/month=10/sales.csv
Validation
The tool validates your configuration:
- Warns if "Split by Date" is enabled but no date placeholders found
- Warns if "Split by Fields" is enabled but no field placeholders found
- Shows preview of files to be created before upload
🔄 Batch Processing
Automatically process multiple configurations in sequence - perfect for populating entire data warehouses.
How It Works
- Create and save multiple configurations (e.g., "Customers", "Orders", "Products")
- Fill in your S3 credentials once
- Click "Start Batch Upload"
- The tool automatically:
- Loads each configuration
- Generates the data
- Uploads to S3 using each config's settings
- Moves to the next configuration
Pause for Confirmation Mode
- Enable this option to review data before each upload
- Batch process pauses after generating each dataset
- Click "Continue" to proceed with upload or "Stop Batch" to cancel
- Perfect for validating data quality before pushing to S3
Progress Tracking
- Real-time status updates for each configuration
- Success/Failed/Skipped counters
- Detailed console logs for troubleshooting
- Final summary report when complete
💾 Configuration Management
Save your field selections, settings, and S3 configurations for reuse.
Built-in Presets
12 professionally designed presets ready to use:
- Customer Contact List (500 rows)
- Employee Directory (1,000 rows)
- Sales Transaction Log (100,000 rows)
- Product Inventory (2,500 rows)
- Marketing Campaign Leads (5,000 rows)
- User Registration Data (50,000 rows)
- IT Asset Management (1,000 rows)
- E-commerce Orders (100,000 rows)
- Customer Support Tickets (25,000 rows)
- Financial Transactions (100,000 rows)
- Event Registration List (10,000 rows)
- Website Analytics Sample (500,000 rows)
Creating Custom Configurations
- Select your desired fields
- Configure generation settings (row count, dates, IDs, etc.)
- Set S3 path and filename (optional)
- Click "Save Config"
- Enter a name for your configuration
- Configuration is saved to browser localStorage
Managing Configurations
- Load - Select from dropdown and click "Load"
- Save - Update the currently loaded configuration
- Save As - Save current settings with a new name
- Delete - Remove the currently selected configuration
- Clear All - Delete all saved configurations (with confirmation)
Import/Export Configurations
- Export All - Download all configurations as a JSON file
- Import - Load configurations from a JSON file
- Share configurations with team members
- Backup and restore your presets
📋 Console Logging
Real-time visibility into all operations with a professional terminal-style console.
What Gets Logged
- Data generation progress and statistics
- File import and parsing details
- Field validation and auto-add operations
- File grouping and splitting logic
- S3 upload preparation and execution
- Success/failure status with detailed error messages
- CORS configuration issues and solutions
- Batch processing progress and results
Console Features
- Color-coded messages - Green for success, red for errors, yellow for warnings, blue for info
- Timestamps - Track when operations occurred
- Clear button - Start fresh when needed
- Auto-scroll - Follows new messages automatically
🔧 Troubleshooting
Import Issues
Solution: Enable "Auto-add unknown fields" checkbox before importing the file. Custom fields will be automatically added to the selection grid.
Solution: Ensure the file is a valid Apache Parquet format. Large files may take 5-10 seconds to parse. Check the console log for specific errors.
S3 Upload Issues
Solution:
- Click "Test CORS" button to verify bucket configuration
- Add the following CORS configuration to your S3 bucket:
[
{
"AllowedHeaders": ["*"],
"AllowedMethods": ["PUT", "POST", "GET"],
"AllowedOrigins": ["*"],
"ExposeHeaders": ["ETag"]
}
]
Solution: Verify your Access Key ID and Secret Access Key are correct. Ensure there are no extra spaces when pasting credentials.
File Splitting Issues
Solution:
- Ensure "Random Dates" is enabled (not Fixed Date)
- Verify your S3 path contains date placeholders (yyyy, mm, dd)
- Check "Split by Date" checkbox is enabled
Solution:
- Verify your S3 path contains field placeholders like {{country}}
- Ensure the field name in {{}} matches exactly (case-sensitive)
- Check "Split by Fields" checkbox is enabled
Performance Issues
- May take 10-30 seconds depending on your computer
- Browser may show "page unresponsive" warning - click "Wait"
- Consider generating in smaller batches and combining later
- Parquet format generates faster than CSV for large datasets
Configuration Issues
Solution: Configurations are stored in browser localStorage and should persist. If they're missing:
- Check if you accidentally clicked "Clear All"
- Browser privacy/incognito mode doesn't save localStorage
- Export your configurations regularly as backup
General Tips
- Enable Console Logging - Provides detailed information about what's happening
- Test with small datasets first - Generate 100 rows before trying 100,000
- Use built-in presets - Start with working configurations and modify as needed
- Check browser console - Press F12 to see JavaScript errors if something fails