Storage requirements are a big topic for the IT infrastructure side of the business. Sizing MOSS 2007 is a challenging task in a large global organization. Both Microsoft and HP have run MOSS 2007 in their labs and come up with some interesting numbers.
The following table shows the type and number of documents crawled. (Documents were 10 kilobytes (KB) to 100 KB in size.)
  • The Index Server configuration was as follows:
    • 4 dual-core Intel Xeon 2.66 GHz processors
    • 32 GB RAM
    • 40 GB for the operating system (RAID 5)
    • 956 GB for the content index and the operating system paging file (RAID 10)
The following is a summary of the content profile:
  • Content on SharePoint sites – 10 million items, including the following:
    • 420 site collections
    • 4,000 sites
    • 24,200 lists
    • 47,780 document libraries
  • Content on file shares – 15 million items
  • HTTP content – 15 million items
  • People profiles – 2.5 million
  • Stitch (in-memory test tool that generates documents in memory) –  7.5 million
  • Properties (metadata) – 1 million
The following table shows disk space usage.
  • Index size on query server – 100 GB*
  • Index size on index server – 100 GB*
  • Search database size – 600 GB
* The tested index sizes are smaller than what might be observed in a production environment. In the test-generated corpus, the number of unique words is limited and often repeated.
What I find interesting is the performance – a real eye opener for some – quote from Microsofts Estimate performance and capacity requirements for search environments.
“The time to perform a full crawl during testing was 35 days (approximately 15 documents per second). Note that these test results were observed in a production environment where network latency and the responsiveness of the crawled repositories affected crawl speed. Crawl speed measured by documents per second might be significantly faster in a pure test environment, or in environments with greater bandwidth and greater responsiveness of crawled repositories.
If two percent of a corpus of the size used in the test environment changes, an incremental crawl to catch up with the changes takes approximately 8-12 hours, depending on latency and the responsiveness of the sites being crawled. Note that changes to metadata and outbound links take longer to process than changes to the contents of documents.”
The long and the short of it is this, you must understand your data, use factual numbers to calculate the size of your corpus and be prepared to size your storage according. For those that take the low road, pain is sure to follow as the servers and storage systems grow exponentially – almost out of control.