CSV Generator Pro - Help Documentation

🎯 Overview & Key Features

CSV Generator Pro is a powerful, browser-based tool for generating realistic test data with advanced features for data engineering, analytics testing, and development workflows.

Core Capabilities

41+ Field Types - Generate diverse data including names, addresses, products, transactions, and more
Multi-Format Support - Export as CSV, NDJSON, or Parquet files
Import & Convert - Load existing data files and convert between formats with auto-field detection
Deterministic IDs - Generate consistent IDs for SQL joins across datasets
AWS S3 Integration - Direct upload with dynamic path templating and Hive-style partitioning
File Splitting - Automatic split by date fields or custom field values
Batch Processing - Process multiple configurations automatically
Configuration Management - Save and load presets with 12 built-in templates

🆕 What's New in v2.9.2

Auto-Add Unknown Fields - New checkbox in Import Data section allows automatic addition of custom fields from imported files. Enable to automatically add any fields not in the standard list, or disable to reject files with incompatible fields.

🚀 Getting Started

Quick Start (30 seconds)

Open csv-generator-pro.html in your browser
Select fields you want (or click "Select Common")
Set number of rows (default: 1000)
Click "Generate Data"
Click "Download CSV" to save the file

💡 Pro Tip: Start with one of the 12 built-in presets by clicking the configuration dropdown and selecting a template like "Customer Contact List" or "Sales Transaction Log".

📂 Import Data UPDATED v2.9.2

Import existing CSV, NDJSON, JSON, or Parquet files to convert formats, modify data, or extend datasets.

Supported Formats

CSV - Comma-separated values with header row
NDJSON/JSON - Newline-delimited JSON (one object per line)
Parquet - Apache Parquet columnar format

Auto-Add Unknown Fields Feature NEW

🔧 How It Works

The "Auto-add unknown fields" checkbox (enabled by default) controls how the import handles fields not in the standard field list:

✅ When Enabled (Checked)

Automatically adds any custom fields from your file to the available fields list
Added fields appear in the field selection grid
Custom fields are treated as text/string data
Fields persist until you refresh the page
Perfect for importing data with company-specific or custom field names

❌ When Disabled (Unchecked)

Rejects files containing fields not in the standard list
Shows detailed error message with incompatible field names
Provides list of available fields for reference
Suggests enabling auto-add or modifying the file

Import Process

Click "Choose File" in the Import Data section
Select your CSV, NDJSON, JSON, or Parquet file
If using custom fields, ensure "Auto-add unknown fields" is checked
The file will be validated and loaded automatically
Fields from the file will be auto-selected in the grid
Review the data in the preview table
Choose output format and download or upload to S3

💡 Example Use Case: Import a Parquet file from your data warehouse with custom fields like customer_id, order_total, and shipping_method. With auto-add enabled, these fields will be automatically added to the selection grid and you can export to CSV for Excel analysis.

Field Validation

The import system validates all field names and provides detailed feedback:

Compatible fields - Standard fields that exist in the default list
Custom fields - Fields automatically added when auto-add is enabled
Incompatible fields - Fields rejected when auto-add is disabled

⚠️ Important Notes:

Custom fields added during import will not have data generation capabilities
Custom fields are treated as text values from the imported data
Refresh the page to reset to default fields only
Large Parquet files (100k+ rows) may take a few seconds to parse

🎯 Field Selection

Choose from 41+ field types to customize your dataset:

Available Fields by Category

Category	Fields
Identifiers	id, uuid, username
Personal Info	firstName, lastName, name, email, phone, age, gender
Location	address, city, state, country, zipCode, latitude, longitude
Business	company, jobTitle, department, salary, hireDate, startDate
Products	product, sku, category, subcategory, price, quantity, revenue
Dates & Times	date, timestamp, hireDate, startDate
Status & Ratings	status, priority, rating, boolean, isActive, score
Text Fields	description, notes, url, website, ipAddress

Selection Controls

Select All - Selects all 41+ fields
Deselect All - Clears all selections
Select Common - Selects frequently used fields (id, name, email, phone, city, country, date, status)

💡 Tip: Click on a field's label or checkbox to toggle selection. You can also click the field card itself to check/uncheck.

⚙️ Generation Controls

Number of Rows

Set how many records to generate (1 - 1,000,000). Recommended limits:

1-10,000 - Quick testing and development
10,000-100,000 - Medium-scale testing
100,000-500,000 - Performance testing and analytics
500,000+ - Large-scale data warehouse simulation

Filename

Specify the base filename for downloads and S3 uploads. Extensions are added automatically based on output format.

Example: Setting filename to customer_data will create customer_data.csv, customer_data.ndjson, or customer_data.parquet depending on your format selection.

📅 Date Options

Control how date fields are generated in your dataset.

Fixed Date

All date fields use the same date
Useful for daily snapshots
Default: Today's date

Random Dates

Each record gets a random date within your specified range
Perfect for historical data and time-series analysis
Set start and end dates to define the range
Ideal for creating partitioned data sets

⚠️ Note: If you select "Split by Date" for S3 uploads, you must use "Random Dates" to generate multiple date values for proper file splitting.

🔢 Deterministic IDs

Generate consistent, predictable IDs that enable SQL joins across multiple datasets.

How It Works

Deterministic IDs are created by hashing specific field combinations:

Basic Method - Uses: firstName + lastName
Standard Method - Uses: firstName + lastName + email
Enhanced Method - Uses: firstName + lastName + email + date
Auto Method - Automatically selects the best available fields

Use Cases

Example: Multi-Table Relationships

Generate "customers.csv" with 1000 records using Standard method
Generate "orders.csv" with 5000 records using same Standard method
The same customer (e.g., "John Smith" with "john.smith@example.com") gets the same ID in both tables
You can now JOIN the tables on the ID field in SQL

Configuration

Enable Deterministic IDs - Toggle this checkbox to activate
Method Selection - Choose which fields to use for ID generation
ID Range - Set the maximum ID value (default: 10,000,000)

💡 Case-Insensitive: IDs are generated with lowercase normalization, so "Philip Larson" and "philip larson" produce the same ID.

🧮 ID Calculator

Calculate deterministic IDs without needing the original dataset - perfect for reverse lookups and verification.

Using the Calculator

Click "Open ID Calculator" button
Select the method that was used to generate your data
Fill in the required fields (form adapts to method selected)
Click "Calculate ID" to get the deterministic ID

Example Scenario

Problem: You generated customer data last week but need to know what ID was assigned to "Sarah Johnson" with email "sarah.j@example.com"

Solution: Open the ID Calculator, select "Standard" method, enter "Sarah", "Johnson", and "sarah.j@example.com", and click Calculate to instantly get her ID.

📄 Output Format Options

CSV (Comma-Separated Values)

Universal compatibility with Excel, databases, and analytics tools
Human-readable text format
Includes header row with field names
Best for: General use, spreadsheet analysis, simple data exchange

NDJSON (Newline-Delimited JSON)

One JSON object per line
Streaming-friendly format
Perfect for log analysis and event data
Best for: APIs, log processing, streaming applications

Parquet (Apache Parquet)

Columnar storage format optimized for analytics
Excellent compression ratios (typically 10-100x smaller than CSV)
Fast query performance in Athena, Redshift, Spark
Preserves data types and schema information
Best for: Data warehouses, big data analytics, AWS Athena

💡 Format Conversion: Import any format and export to any other format. Load a CSV, export as Parquet. Load Parquet, export as NDJSON. The tool handles all conversions seamlessly.

⬇️ Download Files

Save generated data to your local computer.

Download Options

Single File - Download all data in one file
Multiple Files - When file splitting is enabled, downloads a ZIP containing all split files

Process

Generate your data
Select output format (CSV, NDJSON, or Parquet)
Click "Download" button (specific to format selected)
File(s) will be saved to your browser's download folder

☁️ AWS S3 Upload

Upload generated data directly to Amazon S3 with advanced path templating and partitioning.

Required Credentials

S3 Bucket Name - Your bucket name (e.g., "my-data-bucket")
AWS Region - Select from dropdown (e.g., us-east-1)
Access Key ID - Your AWS access key
Secret Access Key - Your AWS secret key

⚠️ Security Note: Credentials are stored only in your browser's memory for the current session. They are never sent anywhere except directly to AWS S3.

S3 Directory Path (Dynamic Templating)

Use placeholders to create dynamic, organized directory structures:

Date Placeholders

sales/year=yyyy/month=mm/day=dd/
└── Hive-style partitioning: year=2024/month=11/day=02/

Field Placeholders

customers/country={{country}}/status={{status}}/
└── Dynamic grouping: country=USA/status=active/

Combined Approach

transactions/category={{category}}/year=yyyy/month=mm/
└── Both field and date partitioning for optimal query performance

Upload Modes

Quick Upload - Upload all files at once
Step-by-Step - Upload one file at a time with manual control

Testing & Debugging

Test CORS - Verify your bucket's CORS configuration before uploading
Enable Console - See detailed upload logs and error messages

✂️ Split Files Feature

Automatically split your dataset into multiple files based on date or field values - perfect for partitioned data lakes.

Split by Date

Creates separate files for each unique date in your dataset
Requires date placeholders in S3 directory path (yyyy/mm/dd)
Must use "Random Dates" option to generate multiple date values
Perfect for daily/monthly/yearly partitions

Split by Fields

Creates separate files for each unique combination of field values
Requires field placeholders in S3 path ({{fieldname}})
Example: Split by country and status creates files for each country/status combo
Ideal for Hive-style partitioning in data warehouses

Example: Sales Analytics

Configuration:
- S3 Path: sales/category={{category}}/year=yyyy/month=mm/
- Split by Fields: ✓ Enabled
- Split by Date: ✓ Enabled
- Random Dates: 2023-01-01 to 2025-12-31

Result: 
Files organized like:
└── sales/
    ├── category=Electronics/year=2024/month=11/sales.csv
    ├── category=Electronics/year=2024/month=10/sales.csv
    ├── category=Clothing/year=2024/month=11/sales.csv
    └── category=Clothing/year=2024/month=10/sales.csv

Validation

The tool validates your configuration:

Warns if "Split by Date" is enabled but no date placeholders found
Warns if "Split by Fields" is enabled but no field placeholders found
Shows preview of files to be created before upload

🔄 Batch Processing

Automatically process multiple configurations in sequence - perfect for populating entire data warehouses.

How It Works

Create and save multiple configurations (e.g., "Customers", "Orders", "Products")
Fill in your S3 credentials once
Click "Start Batch Upload"
The tool automatically:
- Loads each configuration
- Generates the data
- Uploads to S3 using each config's settings
- Moves to the next configuration

Pause for Confirmation Mode

Enable this option to review data before each upload
Batch process pauses after generating each dataset
Click "Continue" to proceed with upload or "Stop Batch" to cancel
Perfect for validating data quality before pushing to S3

Progress Tracking

Real-time status updates for each configuration
Success/Failed/Skipped counters
Detailed console logs for troubleshooting
Final summary report when complete

💡 Pro Tip: Use the built-in presets as starting points. Load a preset, modify the row count and S3 path, save with a new name, and repeat for each table you need.

💾 Configuration Management

Save your field selections, settings, and S3 configurations for reuse.

Built-in Presets

12 professionally designed presets ready to use:

Customer Contact List (500 rows)
Employee Directory (1,000 rows)
Sales Transaction Log (100,000 rows)
Product Inventory (2,500 rows)
Marketing Campaign Leads (5,000 rows)
User Registration Data (50,000 rows)
IT Asset Management (1,000 rows)
E-commerce Orders (100,000 rows)
Customer Support Tickets (25,000 rows)
Financial Transactions (100,000 rows)
Event Registration List (10,000 rows)
Website Analytics Sample (500,000 rows)

Creating Custom Configurations

Select your desired fields
Configure generation settings (row count, dates, IDs, etc.)
Set S3 path and filename (optional)
Click "Save Config"
Enter a name for your configuration
Configuration is saved to browser localStorage

Managing Configurations

Load - Select from dropdown and click "Load"
Save - Update the currently loaded configuration
Save As - Save current settings with a new name
Delete - Remove the currently selected configuration
Clear All - Delete all saved configurations (with confirmation)

Import/Export Configurations

Export All - Download all configurations as a JSON file
Import - Load configurations from a JSON file
Share configurations with team members
Backup and restore your presets

📋 Console Logging

Real-time visibility into all operations with a professional terminal-style console.

What Gets Logged

Data generation progress and statistics
File import and parsing details
Field validation and auto-add operations
File grouping and splitting logic
S3 upload preparation and execution
Success/failure status with detailed error messages
CORS configuration issues and solutions
Batch processing progress and results

Console Features

Color-coded messages - Green for success, red for errors, yellow for warnings, blue for info
Timestamps - Track when operations occurred
Clear button - Start fresh when needed
Auto-scroll - Follows new messages automatically

💡 Debugging Tip: Enable console logging before performing S3 uploads to see exactly what's happening at each step, including the specific error messages if uploads fail.

🔧 Troubleshooting

Import Issues

Problem: Import fails with "incompatible fields" error
Solution: Enable "Auto-add unknown fields" checkbox before importing the file. Custom fields will be automatically added to the selection grid.

Problem: Parquet file won't import
Solution: Ensure the file is a valid Apache Parquet format. Large files may take 5-10 seconds to parse. Check the console log for specific errors.

S3 Upload Issues

Problem: "CORS policy" error when uploading to S3
Solution:

Click "Test CORS" button to verify bucket configuration
Add the following CORS configuration to your S3 bucket:

[
    {
        "AllowedHeaders": ["*"],
        "AllowedMethods": ["PUT", "POST", "GET"],
        "AllowedOrigins": ["*"],
        "ExposeHeaders": ["ETag"]
    }
]

Problem: "SignatureDoesNotMatch" error
Solution: Verify your Access Key ID and Secret Access Key are correct. Ensure there are no extra spaces when pasting credentials.

File Splitting Issues

Problem: Files aren't splitting by date
Solution:

Ensure "Random Dates" is enabled (not Fixed Date)
Verify your S3 path contains date placeholders (yyyy, mm, dd)
Check "Split by Date" checkbox is enabled

Problem: Files aren't splitting by fields
Solution:

Verify your S3 path contains field placeholders like {{country}}
Ensure the field name in {{}} matches exactly (case-sensitive)
Check "Split by Fields" checkbox is enabled

Performance Issues

Generating large datasets (500k+ rows):

May take 10-30 seconds depending on your computer
Browser may show "page unresponsive" warning - click "Wait"
Consider generating in smaller batches and combining later
Parquet format generates faster than CSV for large datasets

Configuration Issues

Problem: Configurations disappeared after browser refresh
Solution: Configurations are stored in browser localStorage and should persist. If they're missing:

Check if you accidentally clicked "Clear All"
Browser privacy/incognito mode doesn't save localStorage
Export your configurations regularly as backup

General Tips

Enable Console Logging - Provides detailed information about what's happening
Test with small datasets first - Generate 100 rows before trying 100,000
Use built-in presets - Start with working configurations and modify as needed
Check browser console - Press F12 to see JavaScript errors if something fails

📊 CSV Generator Pro

📑 Table of Contents

🎯 Overview & Key Features

Core Capabilities

🆕 What's New in v2.9.2

🚀 Getting Started

Quick Start (30 seconds)

📂 Import Data UPDATED v2.9.2

Supported Formats

Auto-Add Unknown Fields Feature NEW

🔧 How It Works

✅ When Enabled (Checked)

❌ When Disabled (Unchecked)

Import Process

Field Validation

🎯 Field Selection

Available Fields by Category

Selection Controls

⚙️ Generation Controls

Number of Rows

Filename

📅 Date Options

Fixed Date

Random Dates

🔢 Deterministic IDs

How It Works

Use Cases

Configuration

🧮 ID Calculator

Using the Calculator

Example Scenario

📄 Output Format Options

CSV (Comma-Separated Values)

NDJSON (Newline-Delimited JSON)

Parquet (Apache Parquet)

⬇️ Download Files

Download Options

Process

☁️ AWS S3 Upload

Required Credentials

S3 Directory Path (Dynamic Templating)

Date Placeholders

Field Placeholders

Combined Approach

Upload Modes

Testing & Debugging

✂️ Split Files Feature

Split by Date

Split by Fields

Validation

🔄 Batch Processing

How It Works

Pause for Confirmation Mode

Progress Tracking

💾 Configuration Management

Built-in Presets

Creating Custom Configurations

Managing Configurations

Import/Export Configurations

📋 Console Logging

What Gets Logged

Console Features

🔧 Troubleshooting

Import Issues

S3 Upload Issues

File Splitting Issues

Performance Issues

Configuration Issues

General Tips