How to Generate Mock Sales Data

12

Search for mock data or mockup data, and you’ll have plenty of options for items like customers or regions. Mockaroo is one popular choice among many for generating such data and will generate up to 1000 records for free. When you try try to find sales data that is worth using, that is another story altogether. The choices are much more limited. This article can show you how to generate meaningful mock sales data.

Challenge

Generating mock sales data is challenging because there are many components to it. The data itself is quite simple and often includes the transaction date, the customer number, the products and quantities of those products. Perhaps it contains the store number to, but to keep it simple, we’ll use only the core components.

The three main tables of most sales schemas are the sales (sometimes called orders), customers, and products. 

Other tables, like stores, regions, and returns, advertising and marketing, can also be integrated, but these are handled in a similar fashion to how products and customers work.

For this tutorial, I will concentrate on just the sales generation, as it’s the most involved procedure to generate records. For products and customers, you can easily find generators online or data that you can download.

Be careful on how you use any data that you find and check to make sure you have rights to use it. If you have a list of common first names and you create an Excel macro to combine them with common last names (Smith, Jones, etc.) then you should be fine. However, this article should not be misconstrued as offering any legal advice. Please seek the advice of a legal professional for this.

The proper format for sales data is a set of pointers to other tables. It is a set of keys and dates. For instance, the sales data I will show you how to generate in this tutorial will contain the following:

Order Entry

Note: due to the tool I was using for my analysis after generating data, the headers in the above sample file are different than what will be generated from the algorithm. Feel free to name them whatever names work for you.

With the exception of the date field, the other fields are numeric pointers. The first column (Order ID, which is orderNum ) is generated as a sequence. The other numeric fields are keyed to other tables that contain descriptions of those keys.

Methodology for Generating Orders

I tried as hard as I could to come up with procedures to generate the orders table completely from Excel and with no macros. I tried to use random number generation followed by IF statements. But I had to do a lot of manually data entry as a result that is proved to be not useful as an order generation table.

I started implementing VBA code (code that results from recording macros). But even that proved too difficult. Instead, I created the routines in R language, which was much easier due to its extensive tools available in the language.

Generating the code in R had its own set of challenges, but they were manageable. For instance, R does not support pass-by-reference variable access. But you can get around this problem by using lists. That is the solution I adopted (more on this later).

My first intuition was to determine the number of orders ahead of time and then try to fill in each of the columns based on a set of rules. In trying to recreate how sales would be generated in a real setting, the following had to be considered:

  • Multiple customers can buy on any given day.
  • Any customer can buy one or many products. Customers who browse and don’t buy don’t need to be considered for sales data. That would be more like foot traffic data, which is difficult to model. For online stores, traffic is easier to model through analytics data. But this is not considered here (perhaps a future enhancement?)
  • To determine the quantity ordered or purchased, customers usually buy one quantity of most products. For this model, I choose to use probabilities for purchasing 1, 2, 3, or 4 quantities of a product. One (1) has the highest probability with four (4) having the smallest. I realize that customers could buy even more than 4, but I had to draw the line somewhere. Feel free to adjust this with more numbers or different probabilities.
  • Some customers are repeat buyers. There are a few ways to handle this scenario. One of the easiest is to have more orders with less available customers – those customers would be repeated by random selection. In other words, if there are 100 orders and 100 customers, the chances of repeats are much less than 100 orders and only five customers in the customer table.
  • When generating order numbers, a customer who buys multiple items should have the same order number for all the items purchased on a given transaction date. This scenario does not take into account time of day. Theoretically, a customer can return to the store on the same day and make multiple purchases. That’s rare enough and unimportant enough not to include in a model.

One feature that won’t be included in this tutorial is associating related products. For example, while a customer may not buy high-end bikes frequently, they may return to a bike store to purchase accessories. Also, it’s likely that salespeople will upsell related products. For instance, a new bike owner may not own a helmet. The salesperson will advise the newbie on the importance of having a helmet. The salesperson may also sell water bottles that fit easily on the new bike. These purchasing habits would alter the sales data significantly. It adds a level of complexity that I will tackle as an enhancement.

Getting Started

One caveat I should mention is that this tutorial will not cover R coding. The constructs used in the tutorial are standard and can be learned from several tutorials on the web. A good place to start is with Data Camp’s R tutorial. They let you try out the first chapter before you need to upgrade. The first chapter covers quite a bit of R coding and is a great overview.

Coding Script

We’re going to create a routine called GenerateOrders that returns a data frame. When you save a data frame using the write.csv() command, it will automatically generate the appropriate columns for you into a Comma-Separated File (CSV). A CSV is one of the most popular formats to use. The file will contain the following fields:

  • OrderNum
  • TransactionDate
  • CustomerID
  • ProductID
  • Quantity

As mentioned above, these column names are different than what appears in the image of the orders table spreadsheet.

I include the R constructs that I used to create this project. This way, you can find information online if you aren’t familiar with them. You can focus your efforts on these constructs, although it won’t hurt to learn as much as you can about the complete language. Here are the constructs used:

  • Data Frames
  • Random number generators (mostly sample() with probabilities associated with them)
  • Function creation
  • Lists (for workaround on lack of pass-by-reference support in R)
  • For loops
  • The rbind() command to add new data frame objects.
  • Reading (future update) and writing CSV files.
  • Working directories

All these constructs are standard within the core R language. In fact, there aren’t any libraries included in the code that I provide. If you already have a basic knowledge of R, you should find this tutorial relatively easy to follow along. If you are new to R, you should consider taking a few tutorials on the language.

You are welcome to bypass learning the language and simply use the code I provide to create sales data for your use. However, if you want to make adjustments to suit your needs, you’ll have a difficult time knowing what to do without a basic knowledge of the R language.

The GenerateOrders() function takes the following form:

GenerateOrders <- function(startingDate, customerMax, productMax, numDates)

Here is a description of the parameters:

startingDate – you can choose whatever date you like for the startingDate parameter. You’ll want to choose a date sometime in the past. With the data I generated for demonstration, I used 11/28/2017.

customerMax – is used to enter the largest Customer Key in your schema. Depending on your requirements, you may want to have this information predetermined. If you are going to use it in a BI reporting tool, each of the Customer IDs will need to be valid.

Note: the algorithm assumes that each number from 1 to the customerMax exists in the customer table. This may not be a valid assumption. For instance, if your customer table contains 1, 2, 3, 5, 6 as valid customer numbers, the number 4 is missing. This algorithm may include (randomly) the number 4. One workaround to this problem is to re-sequence a new field in Excel for customers and use that re-sequence as the key. This would also solve the problem if your customer key contains alphanumeric characters instead of sequential numbers.

As an enhancement, I plan on reading customer tables and extracting the sequence numbers directly. This would eliminate the need to enter a value for customerMax. The same is true for productMax (next).

productMax – this field is the highest number product id, assuming that product id’s are sequential and all numeric. See the above explanations for customerMax for possible enhancements and workarounds.

numDates – the algorithm will sequence a series of dates starting with the starting date. It will use this sequence of dates as the main iterator in the GenerateCustomer routine. Bear in mind that the total number of items generated is likely to be much greater than the number you enter for this parameter. For demonstration purposes, I chose 1000 dates to be initially generated. Due to the rules specified previously, the number of items that were generated from the algorthim was over 5,000.

Algorithm

GenerateOrders <- function(startingDate, customerMax, productMax, numDates) {

  df <- data.frame(orderNum=NULL, transactionDate=NULL, customerID=NULL, productID=NULL, quantity=NULL) 

  oNum <- 1

  dateArr <- seq(as.Date(startingDate), by = “day”, length.out = numDates)

  dateArr <- as.character(dateArr)

  for(dt in dateArr) {

    dfTemp <- GenerateCustomers(dt, customerMax, productMax, oNum)

    df <- rbind(df, dfTemp[[2]])

    oNum <- dfTemp[[1]] # this is needed as R does not support by reference variables

  }

  return (df)
}

The first line initializes the main data frame that gets returned with the series of orders attached. It sets the column names and sets the values to NULL. Setting all the values to NULL sets up this data frame as a placeholder without creating a record.

oNum is short for order number. This counter will be needed to continually increase as orders are added to the data frame. As mentioned, R has no native ability to pass parameters by reference. That is why GenerateCustomers returns a list instead. The list contains the oNum and the data frame containing the records for each data. The oNum is retrieved and updated after the call so that the continuous sequence for order numbers is maintained. Without this, oNum would reset to 0 after each call to GenerateCustomers.

Next, we create a sequence of dates that will be used as the iterator of the loop. This I name dateArr. I chose to change the dates to strings as when I first created this, it was changing the values to the numeric equivalent of the dates. For some reason, it wasn’t correct, though. The day and month were right, but the year was for 1950. CSV files don’t understand date formats anyway. Making it a string (using as.character()) is a safer way to approach this.

The next step is the central focus of the algorithm, i.e., to pass each date to the GenerateCustomer() function. The GenerateCustomers() function creates a data frame using the rules described previously. More detail about the code itself will be described in the next section. The function returns a list containing the order number (oNum) and the data frame filled with orders based on the rules. This gets added to the master data frame that will be returned from GenerateOrders().

GetProductQuantity <- function() {

  x = c(1, 2, 3, 4)

  px = c(0.70, 0.17, 0.1, 0.03)

  #set.seed(123)

  sample(x, size = 1, replace = TRUE, prob = px)

}

As the name suggests, the GetProductQuantity() function is used to generate the quantity field of the orders table. The probabilities weight the purchase of one quantity per product as the highest. The purchase of four of a product occurs at the lowest probabilities. Feel free to adjust the probabilities and to add or subtract numbers as you see fit. It’s not required for the probabilities to add up to one, but that is done for most of the random generation with probabilities in this tutorial.

GetCustomerQuantity <- function() {

  x = c(1, 2, 3, 4)

  px = c(0.40, 0.30, 0.20, 0.10)

  #set.seed(123)

  sample(x, size = 1, replace = TRUE, prob = px)

}

The GetCustomerQuantity() function will create between one and four customers per order date. This satisfies the rule that multiple customers can buy on a given date. Just like the product sampling, the weighting can be adjusted on its probabilities. You can also add more customers if that satisfies your requirements.

GenerateCustomers <- function(dt, customerMax, productMax, oNum) {

  df <- data.frame(orderNum = NULL, transactionDate = NULL, customerID = NULL, productID = NULL, quantity = NULL)

   customersToGenerate <- GetCustomerQuantity() #simulates multiple customers per date

  for (cq in 1:customersToGenerate) {

    customerIDs <- sample(customerMax, customersToGenerate, FALSE)

    keepSameCustomer <- sample(c(TRUE,FALSE), prob = c(0.40, 0.60), 1)

    if (customersToGenerate > 1 && keepSameCustomer) {

      customerIDs[2:customersToGenerate] = customerIDs[1]

    }

       for (cust in 1:length(customerIDs)) {

      dfTemp <- CreateTransaction(dt, customerIDs[cust], productMax, oNum)

      df <- rbind(df, dfTemp)

      if(!keepSameCustomer & cust < length(customerIDs)) {

        oNum <- oNum + 1

      }

    }

       oNum <- oNum + 1

  }

   retVal <- list(oNum, df)

  return (retVal)

}

The GenerateCustomers() function implements the rules describe previously. The parameters are:

dt – The date generated from the iterators in GenerateOrders(). The GenerateCustomers will be called for each date contained in the iterators.

customerMax – the largest customer id from the customers table. It is used as a bounds for the customer sampling. Assumes sequential ordering of customer IDs. See above text for explanation.

productMax – the largest product id from the products table. It’s used as a bounds for the product sampling. Assumes sequential ordering of product IDs. See above text for explanation.

oNum – the order numbers associated with the date (dt).

GenerateCustomers() Algorithm Explanation

The function starts out by initializing a data frame, which will contain the series of orders associated with the date.

It determines the number of customers to generate for this date using the call GetCustomerQuantity() which is a random sample of between one and four customers. It uses weighted probabilities. This number is also used to bound the first for loop for each customer in the list.

The first step in the customer for loop is to generate a random sequence of customers based on the returned number of customers from GetCustomerQuantity(). The sample is weighted and you are free to experiment with the weights.

The keepSameCustomer variable is used to satisfy the rule that some customers may purchase multiple products within the same order on the given date. It uses a weighted probability sample returning TRUE or FALSE. If it is TRUE, all the customers in this batch will be set to the first customer in the list of customerIDs.

The batch of customers will either be different customer IDs or the same, based on the results of the keepSameCustomer. Either way, the batch is written to the data frame using the CreateTransaction.

As you’ll notice, there is a provision in the routine to keep the same oNum if the keepSameCustomer is set to TRUE. This satisfies the rule of one customer can buy multiple products. If keepSameCustomer is FALSE, the oNum is incremented each time the data frame is generated.

Once all the appropriate orders for customers (either same or different) are generated, the data frame containing those orders are placed on the list along with the oNum. This helps to maintain the value of oNum when the function returns to GenerateOrders. As mentioned previously, R does not support by reference variables, which makes this workaround necessary.

CreateTransaction <- function(dt, customerID, productMax, oNum) {

  orderNum = oNum

  transactionDate = dt

  customerID = customerID

  #set.seed(123)

  productID = sample(1:productMax, 1)

  quantity = GetProductQuantity()

  df <- data.frame(orderNum, transactionDate, customerID, productID, quantity)

   return (df)

}

The CreateTransaction() function is mostly a passthrough function. It does generate the quantity (discussed previously) value in this module. This module could probably be replaced by a direct call in the GenerateCustomer() function.

How to Use Mock Sales Data Generator

Sales Data

Now that you have the code to create sales or order data, how to you run it? You’ll need a R programming installation to make it work. However, the good news is you have several choices with this. The first is to install R on your local hard drive. You can find out everything you need to do this here (free):

https://www.r-project.org/

If you decide to go that route, it pays to also install a graphical interface. Most people choose RStudio, as it’s one of the most robust solutions. The URL for this solution is:

https://rstudio.com

It’s best to install R first before installing RStudio. It makes the configuration much easier.

If you don’t want to install a local installation, RStudio has an out-of-the-box solution. It’s cloud-based, and you can do most of what you can do with a local installation. Here is the link:

https://rstudio.cloud

You’ll need to sign up for an account. You can use social media accounts, including github.com. If you do development in any form, having a github.com account is handy. However, you can use your Google account to sign in as well. You’ll have an extra step with this option when you save the result to your workspace (on the cloud). You’ll have to export the results to your local drive to use them.

https://rdrr.io

This is a brute-force option that let’s you run your code. You can use this option to run the code, but if you are looking to retrieve the results as a CSV file (more below), I haven’t found a way to do that using this solution. For this reason, I don’t recommend this option. The code will run, but it won’t be useful if you can’t retrieve the results.

You can run the solution as follows:

orders <- GenerateOrders(‘2017-11-28’, 1000, 15, 1000)

write.csv(orders, “orders.csv”, row.names = FALSE)

Change the parameters in the GenerateOrders() function to suit your needs. If you only have 500 customers with sequential customerIDs from 1 to 500, then that is the number to enter for the customerMax parameter (500). Adjust the productMax parameter in the same manner.

The code for this project worked on all the platforms. As mentioned for rdrr.io, there didn’t seem to be any means of downloading the CSV file generated from the solution.

Conclusion

Hopefully, you found this solution for generating mock sales data to be useful. It may not do everything you need, but it does supply you with a framework to adapt however you like. When I make updates to this module (such as the enhancements mentioned), I’ll reference this article for easy access.

Let me know in the comments if you found this solution to be useful and how you used it. What changes or enhancements did you make? Also, please don’t hesitate to share this on social media.

Resources

Download complete code – the code resides on my GitHub repository.

Mockaroo – generates various files including names, which could be used as customer files.

DataCamp – the easier way to learn data science.

Find Your Use Cases – data won’t do you much good without putting it to some kind of use. The following article describes how you can find the right use cases that you need.

Super Data Science – this resource is comprehensive and will teach you several necessary and useful data science techniques. R coding is part of the curriculum.

Need some more resources to help learn data science? The following article has a bunch you can use to help with this effort.

Data Sources

https://datasciencereview.com/data-sources/