- •2. Launch Visual Studio 2013 as administrator.
- •Implementing a product recommendation solution using Microsoft Azure Overview Problem statement
- •Solution
- •Solution architecture
- •Setting up the Azure services Prerequisites
- •Deploying the Azure landscape using PowerShell
- •Publishing the retailer demo website using Visual Studio
- •Testing the product recommendation site
- •Verifying execution of the Azure Data Factory workflow
- •Viewing product-to-product recommendations
- •Deep dive into the solution
- •Prepare
- •Analyze
- •Publish
- •Consume
- •Next steps
- •Useful resources
- •Roll back Azure changes
- •Terms of use
Prepare
The raw customer usage web logs files are partitioned in the data preparation step into a year/month format for efficient querying and scalable long-term storage. (One blob account faults over to the next as the first account fills up.) The output data (labeled PartitionedProductsUsageTable) is to be kept for a long period as the foundational/rawest form of data in the customer's “data lake.” The input data to this pipeline would typically be discarded—as the output data is full fidelity to the input, just better stored (partitioned) for subsequent use.
The raw data is partitioned using a Hive HDInsight activity in PartitionProductUsagePipeline. A year's worth of sample data generated above is partitioned by year/month to generate customer usage data partitions for each month of the year (12 partitions total).
This is achieved using the partitionproductusage.hql Hive script in the demo\scripts\productrecommendations\ folder of the downloaded zip file. The PartitionProductUsage Hive script follows.
Once the pipeline is executed successfully, the following partitions will be generated in your storage account under the productrec container. (See demo/productrec-accounts.txt file for logon details.)
After the customer usage web logs data is partitioned for manageability, performance, and availability, the data is prepared/shaped before feeding it into the machine learning model to generate recommendations. This is accomplished using the PrepareMahoutUsagePipeline in the following workflow.
The open source Mahout engine is being used in this demo to provide personalized product recommendations to customers on the retail site. The Mahout engine expects the customer usage data in a particular format, namely CustomerID, ProductID. This is achieved using the preparemahoutinput.hql Hive script in the demo\scripts\productrecommendations\ folder of the downloaded zip file.
Analyze
The prepared customer usage web logs data is then fed to a machine learning model to generate user-to-product recommendations and product-to-product recommendations. This is done using two pipelines: the ProductsSimilarityMahoutPipeline and the ProductsRecommenderMahoutPipeline. Each pipeline uses a MapReduce HDInsight activity in Azure Data Factory to invoke the Mahout JAR file uploaded to your storage account as part of the setup script.
The ProductsSimilarityMahoutPipeline uses the ItemSimilarityJob in Mahout to generate a product similarity matrix.
The output of the pipeline is in ProductId1, ProductId2, Probability format.
The ProductsRecommendationMahoutPipeline uses the RecommenderJob in Mahout to generate a personalized product recommendation matrix.
The
output of the pipeline is in UserId,
Recommendations format.
The Mahout recommendation engine benefits retailers in the following ways:
1. Generates personalized product recommendations (recommendations per user based on their personal preferences).
2. Allows pre-computation of the recommendations cache. Retailer websites can look up the cache and optimize page loads. This generates greater user engagement and, ultimately, helps increase sales.
3. Provides data and computation locality. Based on retailer size and popularity, usage data may be very large, so computation locality to data helps limit data movement as much as possible.
Product-to-product recommendations
For the product-to-product recommendation process, a threshold of 70 percent has been selected for this demo to determine similar products based on the item similarity matrix (generated as part of the ProductsSimilarityMahoutPipeline). Retail websites can set this threshold to any value in the Hive script used to generate the Item Similarity matrix. This is done using a Hive HDInsight activity in the MapSimilarProductsPipeline in ADF.
The Hive script selectsimilarproducts.hql is located in the demo\scripts\productrecommendations\ folder of the downloaded zip file.
After the pipeline is executed successfully, it generates the item similarity matrix (product-to-product recommendations) in the following format in an Azure Blob location.
User-to-product recommendations
For user-to-product recommendations, the output generated above as part of the ProductRecommenderMahoutPipeline has been exploded to then generate the output in UserId, RecommendedProductId format. This is done using a Hive HDInsight activity in the MapRecommendedProductsPipeline in ADF.
The Hive script recommendedprducts.hql is located in the demo\scripts\productrecommendations\ folder of the downloaded zip file.
After the pipeline is executed successfully, it generates the personalized recommendation matrix (user-to-product recommendations) in the following format in an Azure Blob location.
