At some point during the development lifecycle someone, be it a developer, QA, or a business owner, asks for "real" data to test against. Typically this is for two reasons, they want the volume of data to match production, or the "fake" data doesn't make any sense. A request such as this came across my desk this week for a load test. Rather than use production data, I thought this would be a good opportunity to try out Redgate's SQL Data Generator Tool.
Redgate SQL Data Generator
Redgate's SQL Data Generator was written for this very scenario, generating data for testing. How it works is it reads the schema of the database and determines the type of data needed. If the column is a date, it will generate random date data. If the column is a varchar, it will generate random strings. The same for ints, bits, and so on.
Random strings, such as AG4f3dks, are typically not very helpful. But the folks over at Redgate have thought that through. SQL Data Generator also looks at the column names and compares that against a rules engine. If the name matches a rule, it will instead generate pre-defined data. Chances are you probably have a column in your database called FirstName or NameFirst. SQL Data Generator will see that column name and populate that column with a set of predefined first names.
For each table SQL Data Generator allows you to control how much data is generated. By default, it is 1,000. For each column you can also control:
- Percentage of Null Entries (if null is allowed)
- Uniqueness of the data
- The min and max possible value (for dates and numbers)
First Try: All Random Data
I started up SQL Data Generator, pointed it at the database, told it to create a lot of random data, with the volume of data between a couple of hundred rows and a couple of million rows. SQL Data Generator kicked off and about 10 minutes later I had a database full of data. Yay!
Except for one little problem. The data was too random. If such a thing existed. The person working on the load test told me his tests weren't going to be testing as close to production as possible. In addition to that, the scope changed. Another person doing the testing was hoping to use the same data set to do some manual testing.
Second Try: Only PII Data Randomized
SQL Data Generator worked as advertised, but it didn't meet our needs. What they wanted was production data but with all the Personally Identifiable Information (PII) randomized.
Side Note: What's funny is no matter where I work this is the most requested item. People want the data to match production, but with all the PII data scrubbed. I can see their point; it helps with their testing. Random data serves no purpose; it needs context.
I fired back up SQL Data Generator with the hope it could do that. I was happy to find out SQL Data Generator supports that scenario.
Note: As I was setting up the second try, I had a feeling I would be doing this again, so I took screenshots to create a step by step guide so I could remember how to do this after a couple of months.
In a nutshell, SQL Data Generator will be configured to pull data from a source database. Any PII data will be replaced with random data.
Select the database to populate
After logging in you will be presented with the SQL Data Generator UI. I forgot to take a screenshot, so here is one from Redgate's site!
Deselect unwanted tables
On the left menu uncheck the tables you don't want to be populated.
Select Source Tables
Notice in the UI you have two options, generate data or use existing data source. Click on the radio button next to use existing data source. Then click on the browse button to select the data source.
You will be prompted to select the source database.
Next up select the table. Chances are the schema matches, which means the table names match, the UI will automatically select the table name for you.
Now the UI will look like this after you select a data source table.
Change Individual Column Data
In the bottom part of the UI, you can click on individual columns. When doing so the top part of the UI will show the default column mapping.
In this example, we will be using the NameFirst column. If you click on the text box next to the label "Generator", you will be presented with an extensive drop-down list. Because this is for first names select Personal -> First Name
Now the UI will change to show you instead of pulling from the source table it will instead use names from the "NameFirst.txt" file that came with SQL Data Generator.
For other columns such as the street number or zip code, you can select Business -> 5 Digit IDs. This creates a predefined regular expression. In the screenshot below I changed it to generate a ten digit number.
Wash, Rinse, Repeat
Repeat the previous steps for each of the tables you want to populate. Depending on the size of the database this can be rather tedious. But, this only has to be done once, just save as a SQL Data Generator project for re-use!
Click the "Generate Data..." button to kick off the process
Warning! By default all existing data will be truncated
Verify the data. SQL Data Generator includes a preview feature, which helps, but it is always good to verify the data after it finishes generating it. Often I found another column I wanted to be populated with random data. I would update the column in the SQL Data Generator UI and kick it off again. The tool will truncate and reload all the data.
One thing to note about randomizing PII data. Be sure the business doesn't require address verification using Melissa data. SQL Data Generator will generate random addresses which will fail that check. You might also have other business rules, in that case, you will need to make use of Python scripts. I didn't have to do this, so I never had the chance to capture a screenshot. Here is another one from Redgate!
In the case of this particular application, we didn't have any custom rules, so we were able to go to town on generating random PII data.
When I handed over the new dataset to the testers, they were much happier. The data met their needs. It matched real-world data, but without compromising our data security policies.
The power of SQL Data Generator is the ability to target specific columns and tables to generate random data for. All the other tables, lookup tables, tables not containing PII data, and so on, can be pulled in as is. Not only can I generate data for testers to use, but I can also generate data for developers to have for their local SQL Server instances. Being able to do that, I see SQL Data Generator as a handy tool in Software Development Lifecycle and Database Lifecycle Management processes.