Thursday, January 22, 2015

Azure Point to Site VPN with SQL Server and FileTables


There are times when you have a hybrid environment and it’s not feasible to join Azure to your current active directory and you need a solution that allows Azure access to your on premise environment.  While it’s straight forward to access a sql server instance residing on an Azure virtual machine it requires more effort for Azure to access your on premise sql server.  One solution is a point to site VPN.
Configuring an on premise instance of sql server to be accessible from an azure vm requires the following
  1. Azure Point to Site VPN
  2. SQL Server Configured for Remote Access
  3. FileTable Configured for Remote Access
Azure Point to Site VPN:
Microsoft gives in depth instructions for setting up a point to site VPN here:
https://msdn.microsoft.com/en-us/library/azure/dn133792.aspx
Once you’ve set up the VPN connect to the your virtual network.  This blog will use MarinerNet.
We’ll need to get the internal ip address of the on premise machine.  Open Network and Sharing and click on MarinerNet to bring up the Status dialog.
image
Click details in the status box and retreive the internal ip address assigned by Azure.
image

Configure On Premise Sql Server for Remote Access:
In order to access the machine from Azure you’ll need to allow remote connections for the sql server instance and open the port that sql server is listening on.
To allow remote connections open SSMS and open server properties.  Click connection and check Allow remote connections.
image
Now that we’ve allowed remote connections we will need to allow the Azure VM through the firewall.  This can be done by opening up port 1433.
Go to Control Panel\System and Security\Windows Firewall and click Advanced Settings

image

In Windows Firewall and advanced security click Inbound Rules and then under Actions New Rule.  The dialog below will appear for rule type select port.
image

Then for protocol and ports enter a specific local port which is 1433.
image
Next select allow the connection when a machine attempts to connect on port 1433.
image
The rule will always apply
image
Now give it a name.  I used Sql Port Inbound
image
Connectivity has now been set up between the Azure VM and the on Premises Sql Server.
Using the internal ip address obtained earlier and sql server authentication because the Azure VM and on Premises Sql Server are in two different domains we can connect to the On Premises Sql Server from the Azure VM.
 image

Now that we’re able to connect to sql server from Azure let’s also add access to the file tables on the on premise sql server instance.
This requires allowing remote access to filestreams and creating a local user on the on premise sql server.
On the FileStream tab of Sql Server properties in Sql Server Configuration Manager check the Allow remote clients access box.
image
Create a local user and give them access to the FileTable in Sql Server.
All you need to do is plug in the URL
\\10.0.0.2\mssqlserver\AdventureWorks2012\AdventureWorks2012FS
and enter the local user credentials when prompted.
image

The main drawback to this approach is that the Internal IP Address can change each time the vpn connection is reestablished.  One way around this is a VPN for each machine for example create MarinerNet1 for Sql Server1 that only has 10.0.0.2 as a possible address and MarinerNet2 for Sql Server2 that only has 10.0.0.3 as a possible address.

Friday, January 27, 2012

Using SQL Partitions and the $partition function to define Cube Measure Group Partitions

 

Recently I partitioned a fact table in Sql Server and not surprisingly when it came to the cube this table’s measure group also needed to be partitioned.  The sql server partition was a sliding month partition with each of the 12 months ,the previous year, and an Archive partition. 

Instead I allowed Sql Server to handle the partitioning  for me by creating the measure group partitions based on views using the $partition function.

Select *
From Fact
Where $Partition.SlidingMonth(TransactionDate) in (15,14,13)


The first three months were partitions 15,14,13 etc.


I’ve also used the $Partition function to keep cube measure groups of relatively equal size.


Select *
from Fact
Where $Partition.SlidingMonth(TransactionDate) % 4 = 2


Remember to make sure that the partitions in Sql Server and the cube are working in tandem and that you’re not adding unnecessary complexity. 

Tuesday, May 17, 2011

Retrieving SSIS Error Column Names

I recently coded a data load for a client and they requested that I include the column names on which the package errored.  The solution I came up with consisted of using the lineageids from the package’s XML.  I’ll take you through the code I used.
The first thing I did is load the document and namespace where package path is the location of the package.
doc.Load(Variables.PackagePath);

XmlNamespaceManager nsmgr = new XmlNamespaceManager(doc.NameTable);
nsmgr.AddNamespace("DTS", "www.microsoft.com/SqlServer/Dts");


Then I retrieved each node with a lineageid and name using an xpath query.

foreach (XmlNode childnode in doc.SelectNodes("//*[@lineageId != '' and @name != '']"))


I then pulled out the TaskName, ColumnName, and LineageID.  I included the TaskName because LineageIDs are not unique within a package.


XmlNode ExecutableNode = childnode.SelectSingleNode("ancestor::DTS:Executable[1]", nsmgr);
TaskName = ExecutableNode.SelectSingleNode("DTS:Property[@DTS:Name='ObjectName']", nsmgr).InnerText;
ColumnName = childnode.Attributes["name"].Value;
LineageID = Convert.ToInt32(childnode.Attributes["lineageId"].Value);


I then further deduped within the task by only inserting unique taskname, columnname, and lineageid pairs using a hashtable


DistinctColumnKey =LineageID+ColumnName+TaskName;
if (!DistinctColumn.ContainsKey(DistinctColumnKey))
{
DistinctColumn.Add(DistinctColumnKey,DBNull.Value);
ColumnNamesBuffer.AddRow();
ColumnNamesBuffer.LineageID = LineageID;
ColumnNamesBuffer.ColumnName = ColumnName;
ColumnNamesBuffer.TaskName = TaskName;
}


I usually insert the column names into a cache connection manager and then use throughout the package.  Here’s the entire script.


public override void CreateNewOutputRows()
{
Int32 LineageID;
String ColumnName;
String TaskName;
XmlDocument doc = new XmlDocument();
Hashtable DistinctColumn = new Hashtable();
String DistinctColumnKey;

doc.Load(Variables.PackagePath);

XmlNamespaceManager nsmgr = new XmlNamespaceManager(doc.NameTable);
nsmgr.AddNamespace("DTS", "www.microsoft.com/SqlServer/Dts");

foreach (XmlNode childnode in doc.SelectNodes("//*[@lineageId != '' and @name != '']"))
{


XmlNode ExecutableNode = childnode.SelectSingleNode("ancestor::DTS:Executable[1]", nsmgr);
TaskName = ExecutableNode.SelectSingleNode("DTS:Property[@DTS:Name='ObjectName']", nsmgr).InnerText;
ColumnName = childnode.Attributes["name"].Value;
LineageID = Convert.ToInt32(childnode.Attributes["lineageId"].Value);
DistinctColumnKey =LineageID+ColumnName+TaskName;
if (!DistinctColumn.ContainsKey(DistinctColumnKey))
{
DistinctColumn.Add(DistinctColumnKey,DBNull.Value);
ColumnNamesBuffer.AddRow();
ColumnNamesBuffer.LineageID = LineageID;
ColumnNamesBuffer.ColumnName = ColumnName;
ColumnNamesBuffer.TaskName = TaskName;
}



}


}





Friday, November 5, 2010

Using a CTE in a SSRS Hidden Cascading Parameter

I had a situation where a client wanted to allow the report developer to tie each year of a report to a different dataset.  The user would choose the report year and based on a hidden parameter set by the developer it would use the correct data set.  The catch was I couldn’t create anything on the server.  The solution I came up with was a hidden cascading parameter using a CTE.
The visible parameter is @AcademicYear with the following parameter labels
20062007
20072008
20082009
20092010
20102011
For the hidden parameter I created a dataset called SnapShot with the following query.
WITH SnapShots
(AcademicYear, Snapshot) AS
(SELECT '20062007','1'
UNION
SELECT '20072008','2'
UNION
SELECT '20082009','2'
UNION
SELECT '20092010','3'
UNION
SELECT '20102011','4'
)
SELECT  AcademicYear, SnapShot
FROM SnapShots
WHERE AcademicYear=LTRIM(RTRIM(@AcademicYearLabel))




The parameter @AcademicYearLabel is based on @AcademicYear.  To create it choose parameters in the DataSet Properties dialog

image


Click the image button to create an expression and choose the following

Category – Parameters

Item – All

Values – AcademicYear

Under Set Expression for: Value change

=Parameters!AcademicYear.Value


To


=Parameters!AcademicYear.Label






image


Click OK.  The parameter value should be updated.

image


Now we’ll create a parameter named @SnapShot to go with the dataset.  Set the Name and Prompt to SnapShot, Data Type to Text and Parameter Visibility to Hidden. 

image


Choose DefaultValues.  Select Get values from a query and choose the SnapShot dataset and SnapShot Value field.

image





The report is now customizable by dataset.

Tuesday, October 26, 2010

SSIS Cache Transform as Source Query during For Loop

 

Recently I had a relatively slow performing source query within a for loop.  The for loop ran approximately 12 times each time running this query.  I solved the problem by calling the query once, caching the results, and performing look ups instead of executing the query again.

Here’s the control flow

image

Going into DFL Cache Data

image

In order to perform a lookup that returns all of the relevant rows the query for OLE_SRC School History Src needs to have a unique identifier.

SELECT ROW_NUMBER() OVER (ORDER BY RAND()) ID, *
FROM ComplexQuery



Since I’m going to use year as the parameter in the for loop I’m placing the Cache Connection Manager index on ID and YearID.


image


Now that I’ve filled the cache I’m going to loop by year over the dataflow DFL Import DimSchool


image


Here’s DFL Import DimSchool


image


Next generate a list of numbers with the for loop variables.  To do this create a variable called SQLCommand.  Set EvaluateAsExpression to True with the expression as


"WITH Num1 (n) AS (SELECT 1 UNION ALL SELECT 1),
Num2 (n) AS (SELECT 1 FROM Num1 AS X, Num1 AS Y),
Num3 (n) AS (SELECT 1 FROM Num2 AS X, Num2 AS Y),
Num4 (n) AS (SELECT 1 FROM Num3 AS X, Num3 AS Y),
Num5 (n) AS (SELECT 1 FROM Num4 AS X, Num4 AS Y),
Num6 (n) AS (SELECT 1 FROM Num5 AS X, Num5 AS Y),
Nums (n) AS (SELECT ROW_NUMBER() OVER(ORDER BY n) FROM Num6)
SELECT n ID, " + (DT_WSTR, 4) @[User::_Year] + " YearID
FROM Nums
WHERE n <= 100000"

 

@{User::_Year] is the variable used in the for loop so the value of YearID changes with each iteration.

 

Choose Data access mode as SQL command from variable and select SQLCommand as the variable name.  It results in the following query.

 

WITH Num1 (n) AS (SELECT 1 UNION ALL SELECT 1),
Num2 (n) AS (SELECT 1 FROM Num1 AS X, Num1 AS Y),
Num3 (n) AS (SELECT 1 FROM Num2 AS X, Num2 AS Y),
Num4 (n) AS (SELECT 1 FROM Num3 AS X, Num3 AS Y),
Num5 (n) AS (SELECT 1 FROM Num4 AS X, Num4 AS Y),
Num6 (n) AS (SELECT 1 FROM Num5 AS X, Num5 AS Y),
Nums (n) AS (SELECT ROW_NUMBER() OVER(ORDER BY n) FROM Num6)
SELECT n ID, 2000 YearID
FROM Nums
WHERE n <= 100000


and the following output

 

ID                                                YearID











12000
22000
32000


The lookup is performed on the ID and YearID

 

image

I now have the same records I would’ve gotten by executing the query using the YearID as a parameter.

Tuesday, October 19, 2010

Majority Late Arriving Fact Lookups in SSIS

Usually when I load data into a data warehouse I retrieve only the changes.  Since changes are normally applied to the most recent records doing a lookup on the natural key of the current record and a partial lookup for any that are not associated with that record for type 2 works out well.  I recently had a situation where I needed to reprocess the entire table for every run.  We won’t go into why this was the case.  Needless to say it’s not good.  Consequently performance was horrendous because 70% of the lookups were partial. 
My solution was to use a Merge and Conditional Split to look at the entire dimension table.
image
Let’s start with the dimension (OLE_SRC Dimension).  We’ll use DimStudent as the dimension.  Here’s the query I used
Select StudentID, StudentNaturalKey, EffectiveStartDate, 
COALESCE((SELECT MIN(EffectiveStartDate) FROM DW.DIMstudent where 
EffectiveStartDate>s.EffectiveStartDAte and StudentNaturalKey=s.StudentNaturalKey),'12/31/2099') NextEffectiveStartDate
FROM DW.DimStudent s
ORDER BY StudentNaturalKey



I’m pulling the surrogatekey (StudentID), Natural Key (StudentNaturalKey) , EffectiveStartDate, and determining the NextEffectiveStartDate instead of using EffectiveEndDate because the data warehouse may have gaps or overlap in the dates.  I’m going to join on the NaturalKey in the Merge Transformation so I’m using it to order by.

This is the source import query

SELECT DISTINCT StudentNaturalKey, RecordDate
From Import.Student WITH (NOLOCK)
Order by StudentNaturalKey


I’m pulling back the NaturalKey and RecordDate from the source and ordering by StudentNaturalKey for the Merge Transformation.


Here’s the Merge Transformation joining on natural key


image


Next there’s the conditional split with the following condition to determine the correct record


ISNULL(RecordDate) || ISNULL(StudentID) || (RecordDate >= EffectiveStartDate && RecordDate < NextEffectiveStartDate)



If RecordDate is null then the source record has no date and consequentially there is no corresponding record in the dimension table.  If StudentID is null then there was no corresponding record in the dimension.  Otherwise it checks to see if the RecordDate is between the EffectiveStartDate and the NextEffectiveStartDate.

I then load the matching records into a cache connection manager.  This isn’t the only way but because of the complexity of the transformation dataflow I’d have to use the sort transformation for the merges so caching and then using the lookup transformation performed much better.

image

The cache consists of the natural key, record date, and StudentID.  I look up on the natural key and record date to get the surrogate key.  This allows me to keep the number of records to a minimum as records are often loaded in batches with the same record date.

Tuesday, October 5, 2010

Missing Indexes

I’m back from vacation.  It was wonderful.  Here’s the code I use to help me get a jump on indexes that may need to be created before I get complaints about system performance.
SELECT 
migs.avg_total_user_cost * (migs.avg_user_impact / 100.0) * (migs.user_seeks + migs.user_scans) AS improvement_measure, 
'CREATE INDEX [missing_index_' + CONVERT (varchar, mig.index_group_handle) + '_' + CONVERT (varchar, mid.index_handle) 
+ '_' + LEFT (PARSENAME(mid.statement, 1), 32) + ']'
+ ' ON ' + mid.statement 
+ ' (' + ISNULL (mid.equality_columns,'') 
+ CASE WHEN mid.equality_columns IS NOT NULL AND mid.inequality_columns IS NOT NULL THEN ',' ELSE '' END 
+ ISNULL (mid.inequality_columns, '')
+ ')' 
+ ISNULL (' INCLUDE (' + mid.included_columns + ')', '') AS create_index_statement, 
migs.*, mid.database_id, mid.[object_id]
FROM sys.dm_db_missing_index_groups mig
INNER JOIN sys.dm_db_missing_index_group_stats migs ON migs.group_handle = mig.index_group_handle
INNER JOIN sys.dm_db_missing_index_details mid ON mig.index_handle = mid.index_handle
WHERE migs.avg_total_user_cost * (migs.avg_user_impact / 100.0) * (migs.user_seeks + migs.user_scans) > 10
ORDER BY migs.avg_total_user_cost * migs.avg_user_impact * (migs.user_seeks + migs.user_scans) DESC


You’ll find queries like it all over the internet but not necessarily an explanation of what it’s telling you.  The SQL Server DMVs are based on the same concepts used in query plans and query optimization.

sys.dm_db_Missing_Index_Group_Stats – Updated By Every Query Execution

  1. Avg_Total_User_Cost – A number representing the cost of queries for which the index could have been used
  2. Avg_User_Impact – Percentage by which the average query cost would drop if index was implemented
  3. User_Seeks – Number of seeks caused by queries for which this index could have been used
  4. User_Scans – Number of scans caused by queries for which this index could have been used

sys.dm_db_Missing_Index_Details – Updated Every Time Query is Optimized by the Query Optimizer

  1. Statement – Table where the index is missing
  2. Equality_Columns – Columns used in equality predicates (Column=’a’)
  3. Inequality_Columns – Columns used in a predicate that’s anything except equality such as >
  4. Included_Columns – Columns need to cover the query
  5. Database_ID – Database
  6. Object_ID – Table

The higher the improvement_measure the greater the possibility for improvement.  As always with indexes make sure you look at all of the pros and cons for the index.