📊 CSV Generator Pro

Complete Help Documentation

Version 2.9.2

🎯 Overview & Key Features

CSV Generator Pro is a powerful, browser-based tool for generating realistic test data with advanced features for data engineering, analytics testing, and development workflows.

Core Capabilities

  • 41+ Field Types - Generate diverse data including names, addresses, products, transactions, and more
  • Multi-Format Support - Export as CSV, NDJSON, or Parquet files
  • Import & Convert - Load existing data files and convert between formats with auto-field detection
  • Deterministic IDs - Generate consistent IDs for SQL joins across datasets
  • AWS S3 Integration - Direct upload with dynamic path templating and Hive-style partitioning
  • File Splitting - Automatic split by date fields or custom field values
  • Batch Processing - Process multiple configurations automatically
  • Configuration Management - Save and load presets with 12 built-in templates

🆕 What's New in v2.9.2

Auto-Add Unknown Fields - New checkbox in Import Data section allows automatic addition of custom fields from imported files. Enable to automatically add any fields not in the standard list, or disable to reject files with incompatible fields.

🚀 Getting Started

Quick Start (30 seconds)

  1. Open csv-generator-pro.html in your browser
  2. Select fields you want (or click "Select Common")
  3. Set number of rows (default: 1000)
  4. Click "Generate Data"
  5. Click "Download CSV" to save the file
💡 Pro Tip: Start with one of the 12 built-in presets by clicking the configuration dropdown and selecting a template like "Customer Contact List" or "Sales Transaction Log".

📂 Import Data UPDATED v2.9.2

Import existing CSV, NDJSON, JSON, or Parquet files to convert formats, modify data, or extend datasets.

Supported Formats

  • CSV - Comma-separated values with header row
  • NDJSON/JSON - Newline-delimited JSON (one object per line)
  • Parquet - Apache Parquet columnar format

Auto-Add Unknown Fields Feature NEW

🔧 How It Works

The "Auto-add unknown fields" checkbox (enabled by default) controls how the import handles fields not in the standard field list:

✅ When Enabled (Checked)

  • Automatically adds any custom fields from your file to the available fields list
  • Added fields appear in the field selection grid
  • Custom fields are treated as text/string data
  • Fields persist until you refresh the page
  • Perfect for importing data with company-specific or custom field names

❌ When Disabled (Unchecked)

  • Rejects files containing fields not in the standard list
  • Shows detailed error message with incompatible field names
  • Provides list of available fields for reference
  • Suggests enabling auto-add or modifying the file

Import Process

  1. Click "Choose File" in the Import Data section
  2. Select your CSV, NDJSON, JSON, or Parquet file
  3. If using custom fields, ensure "Auto-add unknown fields" is checked
  4. The file will be validated and loaded automatically
  5. Fields from the file will be auto-selected in the grid
  6. Review the data in the preview table
  7. Choose output format and download or upload to S3
💡 Example Use Case: Import a Parquet file from your data warehouse with custom fields like customer_id, order_total, and shipping_method. With auto-add enabled, these fields will be automatically added to the selection grid and you can export to CSV for Excel analysis.

Field Validation

The import system validates all field names and provides detailed feedback:

  • Compatible fields - Standard fields that exist in the default list
  • Custom fields - Fields automatically added when auto-add is enabled
  • Incompatible fields - Fields rejected when auto-add is disabled
⚠️ Important Notes:
  • Custom fields added during import will not have data generation capabilities
  • Custom fields are treated as text values from the imported data
  • Refresh the page to reset to default fields only
  • Large Parquet files (100k+ rows) may take a few seconds to parse

🎯 Field Selection

Choose from 41+ field types to customize your dataset:

Available Fields by Category

Category Fields
Identifiers id, uuid, username
Personal Info firstName, lastName, name, email, phone, age, gender
Location address, city, state, country, zipCode, latitude, longitude
Business company, jobTitle, department, salary, hireDate, startDate
Products product, sku, category, subcategory, price, quantity, revenue
Dates & Times date, timestamp, hireDate, startDate
Status & Ratings status, priority, rating, boolean, isActive, score
Text Fields description, notes, url, website, ipAddress

Selection Controls

  • Select All - Selects all 41+ fields
  • Deselect All - Clears all selections
  • Select Common - Selects frequently used fields (id, name, email, phone, city, country, date, status)
💡 Tip: Click on a field's label or checkbox to toggle selection. You can also click the field card itself to check/uncheck.

⚙️ Generation Controls

Number of Rows

Set how many records to generate (1 - 1,000,000). Recommended limits:

  • 1-10,000 - Quick testing and development
  • 10,000-100,000 - Medium-scale testing
  • 100,000-500,000 - Performance testing and analytics
  • 500,000+ - Large-scale data warehouse simulation

Filename

Specify the base filename for downloads and S3 uploads. Extensions are added automatically based on output format.

Example: Setting filename to customer_data will create customer_data.csv, customer_data.ndjson, or customer_data.parquet depending on your format selection.

📅 Date Options

Control how date fields are generated in your dataset.

Fixed Date

  • All date fields use the same date
  • Useful for daily snapshots
  • Default: Today's date

Random Dates

  • Each record gets a random date within your specified range
  • Perfect for historical data and time-series analysis
  • Set start and end dates to define the range
  • Ideal for creating partitioned data sets
⚠️ Note: If you select "Split by Date" for S3 uploads, you must use "Random Dates" to generate multiple date values for proper file splitting.

🔢 Deterministic IDs

Generate consistent, predictable IDs that enable SQL joins across multiple datasets.

How It Works

Deterministic IDs are created by hashing specific field combinations:

  • Basic Method - Uses: firstName + lastName
  • Standard Method - Uses: firstName + lastName + email
  • Enhanced Method - Uses: firstName + lastName + email + date
  • Auto Method - Automatically selects the best available fields

Use Cases

Example: Multi-Table Relationships
  1. Generate "customers.csv" with 1000 records using Standard method
  2. Generate "orders.csv" with 5000 records using same Standard method
  3. The same customer (e.g., "John Smith" with "john.smith@example.com") gets the same ID in both tables
  4. You can now JOIN the tables on the ID field in SQL

Configuration

  • Enable Deterministic IDs - Toggle this checkbox to activate
  • Method Selection - Choose which fields to use for ID generation
  • ID Range - Set the maximum ID value (default: 10,000,000)
💡 Case-Insensitive: IDs are generated with lowercase normalization, so "Philip Larson" and "philip larson" produce the same ID.

🧮 ID Calculator

Calculate deterministic IDs without needing the original dataset - perfect for reverse lookups and verification.

Using the Calculator

  1. Click "Open ID Calculator" button
  2. Select the method that was used to generate your data
  3. Fill in the required fields (form adapts to method selected)
  4. Click "Calculate ID" to get the deterministic ID

Example Scenario

Problem: You generated customer data last week but need to know what ID was assigned to "Sarah Johnson" with email "sarah.j@example.com"

Solution: Open the ID Calculator, select "Standard" method, enter "Sarah", "Johnson", and "sarah.j@example.com", and click Calculate to instantly get her ID.

📄 Output Format Options

CSV (Comma-Separated Values)

  • Universal compatibility with Excel, databases, and analytics tools
  • Human-readable text format
  • Includes header row with field names
  • Best for: General use, spreadsheet analysis, simple data exchange

NDJSON (Newline-Delimited JSON)

  • One JSON object per line
  • Streaming-friendly format
  • Perfect for log analysis and event data
  • Best for: APIs, log processing, streaming applications

Parquet (Apache Parquet)

  • Columnar storage format optimized for analytics
  • Excellent compression ratios (typically 10-100x smaller than CSV)
  • Fast query performance in Athena, Redshift, Spark
  • Preserves data types and schema information
  • Best for: Data warehouses, big data analytics, AWS Athena
💡 Format Conversion: Import any format and export to any other format. Load a CSV, export as Parquet. Load Parquet, export as NDJSON. The tool handles all conversions seamlessly.

⬇️ Download Files

Save generated data to your local computer.

Download Options

  • Single File - Download all data in one file
  • Multiple Files - When file splitting is enabled, downloads a ZIP containing all split files

Process

  1. Generate your data
  2. Select output format (CSV, NDJSON, or Parquet)
  3. Click "Download" button (specific to format selected)
  4. File(s) will be saved to your browser's download folder

☁️ AWS S3 Upload

Upload generated data directly to Amazon S3 with advanced path templating and partitioning.

Required Credentials

  • S3 Bucket Name - Your bucket name (e.g., "my-data-bucket")
  • AWS Region - Select from dropdown (e.g., us-east-1)
  • Access Key ID - Your AWS access key
  • Secret Access Key - Your AWS secret key
⚠️ Security Note: Credentials are stored only in your browser's memory for the current session. They are never sent anywhere except directly to AWS S3.

S3 Directory Path (Dynamic Templating)

Use placeholders to create dynamic, organized directory structures:

Date Placeholders

sales/year=yyyy/month=mm/day=dd/
└── Hive-style partitioning: year=2024/month=11/day=02/

Field Placeholders

customers/country={{country}}/status={{status}}/
└── Dynamic grouping: country=USA/status=active/

Combined Approach

transactions/category={{category}}/year=yyyy/month=mm/
└── Both field and date partitioning for optimal query performance

Upload Modes

  • Quick Upload - Upload all files at once
  • Step-by-Step - Upload one file at a time with manual control

Testing & Debugging

  • Test CORS - Verify your bucket's CORS configuration before uploading
  • Enable Console - See detailed upload logs and error messages

✂️ Split Files Feature

Automatically split your dataset into multiple files based on date or field values - perfect for partitioned data lakes.

Split by Date

  • Creates separate files for each unique date in your dataset
  • Requires date placeholders in S3 directory path (yyyy/mm/dd)
  • Must use "Random Dates" option to generate multiple date values
  • Perfect for daily/monthly/yearly partitions

Split by Fields

  • Creates separate files for each unique combination of field values
  • Requires field placeholders in S3 path ({{fieldname}})
  • Example: Split by country and status creates files for each country/status combo
  • Ideal for Hive-style partitioning in data warehouses
Example: Sales Analytics
Configuration:
- S3 Path: sales/category={{category}}/year=yyyy/month=mm/
- Split by Fields: ✓ Enabled
- Split by Date: ✓ Enabled
- Random Dates: 2023-01-01 to 2025-12-31

Result: 
Files organized like:
└── sales/
    ├── category=Electronics/year=2024/month=11/sales.csv
    ├── category=Electronics/year=2024/month=10/sales.csv
    ├── category=Clothing/year=2024/month=11/sales.csv
    └── category=Clothing/year=2024/month=10/sales.csv

Validation

The tool validates your configuration:

  • Warns if "Split by Date" is enabled but no date placeholders found
  • Warns if "Split by Fields" is enabled but no field placeholders found
  • Shows preview of files to be created before upload

🔄 Batch Processing

Automatically process multiple configurations in sequence - perfect for populating entire data warehouses.

How It Works

  1. Create and save multiple configurations (e.g., "Customers", "Orders", "Products")
  2. Fill in your S3 credentials once
  3. Click "Start Batch Upload"
  4. The tool automatically:
    • Loads each configuration
    • Generates the data
    • Uploads to S3 using each config's settings
    • Moves to the next configuration

Pause for Confirmation Mode

  • Enable this option to review data before each upload
  • Batch process pauses after generating each dataset
  • Click "Continue" to proceed with upload or "Stop Batch" to cancel
  • Perfect for validating data quality before pushing to S3

Progress Tracking

  • Real-time status updates for each configuration
  • Success/Failed/Skipped counters
  • Detailed console logs for troubleshooting
  • Final summary report when complete
💡 Pro Tip: Use the built-in presets as starting points. Load a preset, modify the row count and S3 path, save with a new name, and repeat for each table you need.

💾 Configuration Management

Save your field selections, settings, and S3 configurations for reuse.

Built-in Presets

12 professionally designed presets ready to use:

  • Customer Contact List (500 rows)
  • Employee Directory (1,000 rows)
  • Sales Transaction Log (100,000 rows)
  • Product Inventory (2,500 rows)
  • Marketing Campaign Leads (5,000 rows)
  • User Registration Data (50,000 rows)
  • IT Asset Management (1,000 rows)
  • E-commerce Orders (100,000 rows)
  • Customer Support Tickets (25,000 rows)
  • Financial Transactions (100,000 rows)
  • Event Registration List (10,000 rows)
  • Website Analytics Sample (500,000 rows)

Creating Custom Configurations

  1. Select your desired fields
  2. Configure generation settings (row count, dates, IDs, etc.)
  3. Set S3 path and filename (optional)
  4. Click "Save Config"
  5. Enter a name for your configuration
  6. Configuration is saved to browser localStorage

Managing Configurations

  • Load - Select from dropdown and click "Load"
  • Save - Update the currently loaded configuration
  • Save As - Save current settings with a new name
  • Delete - Remove the currently selected configuration
  • Clear All - Delete all saved configurations (with confirmation)

Import/Export Configurations

  • Export All - Download all configurations as a JSON file
  • Import - Load configurations from a JSON file
  • Share configurations with team members
  • Backup and restore your presets

📋 Console Logging

Real-time visibility into all operations with a professional terminal-style console.

What Gets Logged

  • Data generation progress and statistics
  • File import and parsing details
  • Field validation and auto-add operations
  • File grouping and splitting logic
  • S3 upload preparation and execution
  • Success/failure status with detailed error messages
  • CORS configuration issues and solutions
  • Batch processing progress and results

Console Features

  • Color-coded messages - Green for success, red for errors, yellow for warnings, blue for info
  • Timestamps - Track when operations occurred
  • Clear button - Start fresh when needed
  • Auto-scroll - Follows new messages automatically
💡 Debugging Tip: Enable console logging before performing S3 uploads to see exactly what's happening at each step, including the specific error messages if uploads fail.

🔧 Troubleshooting

Import Issues

Problem: Import fails with "incompatible fields" error
Solution: Enable "Auto-add unknown fields" checkbox before importing the file. Custom fields will be automatically added to the selection grid.
Problem: Parquet file won't import
Solution: Ensure the file is a valid Apache Parquet format. Large files may take 5-10 seconds to parse. Check the console log for specific errors.

S3 Upload Issues

Problem: "CORS policy" error when uploading to S3
Solution:
  1. Click "Test CORS" button to verify bucket configuration
  2. Add the following CORS configuration to your S3 bucket:
[
    {
        "AllowedHeaders": ["*"],
        "AllowedMethods": ["PUT", "POST", "GET"],
        "AllowedOrigins": ["*"],
        "ExposeHeaders": ["ETag"]
    }
]
Problem: "SignatureDoesNotMatch" error
Solution: Verify your Access Key ID and Secret Access Key are correct. Ensure there are no extra spaces when pasting credentials.

File Splitting Issues

Problem: Files aren't splitting by date
Solution:
  • Ensure "Random Dates" is enabled (not Fixed Date)
  • Verify your S3 path contains date placeholders (yyyy, mm, dd)
  • Check "Split by Date" checkbox is enabled
Problem: Files aren't splitting by fields
Solution:
  • Verify your S3 path contains field placeholders like {{country}}
  • Ensure the field name in {{}} matches exactly (case-sensitive)
  • Check "Split by Fields" checkbox is enabled

Performance Issues

Generating large datasets (500k+ rows):
  • May take 10-30 seconds depending on your computer
  • Browser may show "page unresponsive" warning - click "Wait"
  • Consider generating in smaller batches and combining later
  • Parquet format generates faster than CSV for large datasets

Configuration Issues

Problem: Configurations disappeared after browser refresh
Solution: Configurations are stored in browser localStorage and should persist. If they're missing:
  • Check if you accidentally clicked "Clear All"
  • Browser privacy/incognito mode doesn't save localStorage
  • Export your configurations regularly as backup

General Tips

  • Enable Console Logging - Provides detailed information about what's happening
  • Test with small datasets first - Generate 100 rows before trying 100,000
  • Use built-in presets - Start with working configurations and modify as needed
  • Check browser console - Press F12 to see JavaScript errors if something fails