With many companies relying on SharePoint for collaboration, content and web publishing services, there is a growing concern around how to manage to expected (agreed upon and documented) service levels. This becomes even more critical in the case of third-party involvement, specific to proving SLAs are complied with and performance penalties are avoided.
Reduced staffing, operating budgets, introduction of outsourcing, multi-party involvement and technical complexity of a SharePoint Service offering – it’s a handful to manage. Another layer of complexity is the multiple points of view (Business and technical teams) monitoring must support. Business users want to know what sites they are being billed for while service owners want to understand health, capacity trends and where potential risk lies. Infrastructure teams (SQL, Servers, Storage, and Network) want insight into capacity and health for trending and analysis purposes.
There are many components that must be monitored, recorded for trend analysis, and alerts generated — in case issues arise. Specifically, companies must include all the components related to the service for monitoring, which includes provisioning of sites, data capture growth, data deletion (disposition), page transaction time, .net counters such as garbage collection (especially with custom code in the mix), servers performance (CPU, Memory, NIC, Disk IO), network appliances (Switch capacity) and SAN (Storage utilization / IOPS).
Here are the top 10 areas companies must address to offer a sustainable SharePoint service with minimal technical, business and financial risk:
- Budgets for tools and staffing – Budget trends are an issue in IT. Often, SharePoint is lower in importance to mission critical applications that play a business critical role for billing, accounting and customer service. SharePoint must work with business sponsors and the IT executive team to justify tools that will enable them to manage technical risks and meet agreed upon service levels. Aligning the stakeholders through a formal governance program is key to success and to avoid politics.
- Education – Beyond the deployment of Monitoring, Analysis and Reporting Services it’s important to integrate the tools into day-to-day operations. For example, the problem management team must be educated regarding monitoring tools and how they help with diagnosis. For business users, they must be educated regarding Monitoring and what’s possible with reporting and how the reports map to service levels. Using Change Management practices can help integrate your investment in Monitoring tools into day-to-day operations.
- Insight regarding counters and meaningful thresholds – There are several counters to monitor in a SharePoint environment. These counters include all aspects of SharePoint, SQL Server, Application Server (IIS/.Net), Windows, Network Components and Storage. Tool venders will provide insight into counters to monitor as well as resources such as Microsoft PAL Toolkit (https://pal.codeplex.com/). Also, Microsoft publishes several helpful documents on the subject for both SharePoint and SQL Server. In general, Monitor ASP.net performance, errors (Events), SharePoint jobs (Indexing), Windows performance and capacity, Network infrastructure performance and capacity and storage performance. Finally, base-lining your environment’s performance using load simulation will help you establish an operational norm for standard operation, which is useful in establishing thresholds and during trouble shooting and analysis.
- SharePoint server health, problem analysis and optimization – SharePoint collects data in its diagnostic log enabling you to troubleshoot the environment. Default settings are generally sufficient for most situations, but if you’re troubleshooting more severe and elusive problems such as bad code or misbehaving third party add-ons you must use Verbose Mode. At this level you are able to capture as much data about the state of SharePoint as possible but the logging introduces overheads. Also, the logs can grow to be very large in size depending how lengthy the monitoring period is. Finally, the analysis of the logs is resource-intensive and requires time to assess. Tools that help you view, filter and find problem areas are required so as to help the problem management process operate more efficiently.
- Operational job status and duration – SharePoint operational jobs include Search Crawls and profile imports to name a few. The health of these jobs is important to the services they support such as Search and Social. In the case of health and capacity, it’s important to monitor the start time, duration and completion time of these jobs as well as any associated errors produced in the event log. Doing so you are accomplishing a few things; 1) confirming jobs completed successfully, 2) proactively monitoring job times and capacity increases and 3) proactively analyzing critical events that occur such as job failures.
- Custom and or third party code – Custom code and third party add-ons introduce more complexity specific to monitoring, analyzing and diagnosing; specifically, knowing what to monitor, how it can help with analysis. Making sure developers follow proper .Net patterns for error catching and event logging are critical as well as knowledge and support regarding the third party code (which the venders can generally provide). In the cases where these components are introduced to your environment it’s important to have a baseline before installation. Once installed, following quality assurance practices to rebase-line are critical in understanding how the components impact SharePoint’s performance and capacity. Specifically, what new counters must be monitored, thresh holds established and or adjusted.
- Application server (IIS / .Net) health, problem analysis and optimization – The Application Server (IIS/.Net) it’s important to monitor ASP.net performance (Networking, Loading, and garbage collection), errors (Events) and custom code and third-party components. When monitoring these components, work with the developer (for custom code) and vender (third-party) to determine what to monitor and how best to set alert thresholds.
- SQL Server health, problem analysis and optimization – SQL Server is the heart of SharePoint and its health critical to SharePoint performance. Having sufficient resources such as CP, Memory and Disk is critical in preventing slow-downs and or interruptions. For example, the disk subsystem must provide enough IOPs to prevent caching backlogs that can lead to slow downs or SQL disconnecting from SharePoint to protect data. Additionally, there must be an operation plan for database maintenance for indexing and fragmentation to ensure optimal operation (Clustered indexes are widely used by SharePoint). Working with the DBA and having SharePoint specific experience is critical to ongoing management and minimizing technical risk.
- Network health, problem analysis, and optimization – With networking, understanding capacity and performance in shared environments is critical; specifically, how the various networking components are performing which becomes even more important in large shared environments hosting multiple neighboring applications (Shared services environments). In such environments resources are shared and knowing your neighboring applications will help you better monitor health, analyze problems and optimize. For example, capacity spikes (network attacks, large amounts of document uploads and downloads where caching isn’t effective) caused by neighboring applications could lead to service performance and availability issues. Additionally, over provisioning of networking equipment could lead to slow downs as well introducing latency between the Domain controllers and SharePoint impacting authentication, the servers and backup services, the Search Crawler and content sources and clients and Web Front Ends.
- Storage health, problem analysis and optimization – When understanding capacity and performance regarding storage, it’s important to adopt the same approach as taken with networking in shared environments — how storage is performing specific to IOPs and growth in capacity, which becomes even more important in large shared environments hosting multiple neighboring applications (Shared services environments). In such environments, resources are shared and knowing your neighboring applications will help you better monitor health, analyze problems and optimize. For example, capacity spikes (disk intensive operational jobs such as backups, various scans and indexing) caused by neighboring applications could lead to service performance and availability issues. Additionally, over provisioning of storage equipment could lead to slow downs as well with reduced IOPS available, which negatively impacts SQL Server performance — and ultimately SharePoint.