Síguenos y descubrirás los mejores trucos y recursos:
¿Te interesa este libro?Cómpralo en nuestra tienda: www.campusmvp.com
Especialistas en formación online y librosde tecnologías Microsoft.
- En papel o en formato electrónico
- Sin DRM- Imprimible- Busca en el contenido
Lightning FAST Enterprise
Searches in Sharepoint 2010
Gustavo Vélez
LIGHTNING FAST ENTERPRISE SEARCHES IN SHAREPOINT 2010
© 2012 KRASIS CONSULTING, S. L. www.campusmvp.net
ALL RIGHTS RESERVED, NO PART OF THIS BOOK MAY BE REPRODUCED, IN ANY
FORM OR BY ANY MEANS WITHOUT PERMISSION IN WRITING FROM THE PUBLISHER
ISBN ELECTRONIC FORMAT: 978-84-939659-1-4
Acknowledgments
In recognition and appreciation, I would like to acknowledge the people involved in
making this project possible. Juan Carlos Gonzalez Martin of the Centro de Innovación
en Integración (Integration and Innovation Center CIIN, http://www.ciin.es, Cantabria,
Spain) and Fabian Imaz of Siderys (Siderys, http://www.siderys.com, Montevideo,
Uruguay), both SharePoint MVPs, recognized SharePoint experts and amazing friends,
agreed to read the text and sift out errors and inconsistencies. And Jose Manuel
Alarcón, editor at Krasis, who ensured the book's publication and has the patience to
hear my complains through the realization of the book and wait till all problems were
solved.
Gustavo Vélez
v
Table of Contents
ACKNOWLEDGMENTS ....................................................................................... iii
TABLE OF CONTENTS ......................................................................................... v
PREFACE ............................................................................................................... vii
CHAPTER 1: INTRODUCTION .......................................................................... 11
1.- Search in the IT world .............................................................................................................. 11 2.- Short history of fast .................................................................................................................. 13 3.- Positioning of fast in the microsoft stack ............................................................................ 15
3.1.- Windows, SQL, SharePoint, SCOM ......................................................................... 16 3.2.- Microsoft Search Products .......................................................................................... 16
4.- Some Important Documentation .......................................................................................... 18
CHAPTER 2: FAST IN THE CONTEXT OF SEARCH ..................................... 21
1.- Goals of search........................................................................................................................... 21 2.- Internet search vs. Enterprise search ................................................................................... 22 3.- Search terminology and concepts ......................................................................................... 23 4.- Fast versions ............................................................................................................................... 28
CHAPTER 3: ARCHITECTURE AND DESIGN ................................................. 31
1.- Conceptual Design .................................................................................................................... 31 2.- Logical Design ............................................................................................................................. 34 3.- Physical Design ........................................................................................................................... 35
3.1.- Extra-Small Farm ............................................................................................................ 35 3.2.- Medium Farm .................................................................................................................. 36 3.3.- Large Farm ....................................................................................................................... 38 3.4.- Extra-Large Farm ........................................................................................................... 41 3.5.- Virtualization of FAST farms ....................................................................................... 42
CHAPTER 4: FAST REQUIREMENTS AND INSTALLATION ........................ 43
1.- Requirements .............................................................................................................................. 43 1.1.- Hardware Requirements.............................................................................................. 43 1.2.- Software Requirements ................................................................................................ 44 1.3.- Environment preparation ............................................................................................. 44
2.- Manual Installation ..................................................................................................................... 45 2.1.- Prerequisites ................................................................................................................... 45 2.2.- Software Installation ...................................................................................................... 46 2.3.- Post-Setup Configuration ............................................................................................ 47
vi Lightning FAST Enterprise Searches in Sharepoint 2010
vi
2.3.1.- Single-server FAST Post-Setup Configuration ................................................ 47 2.3.2.- Farm FAST Post-Setup Configuration .............................................................. 49
3.- Scripted Installation ................................................................................................................... 51 3.1.- Prerequisites ................................................................................................................... 52 3.2.- Software Installation ...................................................................................................... 52 3.3.- Post-Setup Configuration ............................................................................................ 53
CHAPTER 5: CONFIGURATION AND ADMINISTRATION .......................... 57
1.- Configuration .............................................................................................................................. 57 1.1.- SharePoint 2010 Content Search Service Application ......................................... 57 1.2.- SharePoint 2010 Query Search Service Application ............................................. 59 1.3.- Certificates ...................................................................................................................... 60
1.3.1.- Create a new Content Self-Signed Certificate ............................................... 60 1.3.2.- Replace a Content Self-Signed Certificate with a CA Certificate ............. 61 1.3.3.- Query Certificate ................................................................................................... 62 1.3.4.- Certificate for Security Trimming...................................................................... 63
1.4.- SharePoint 2010 Site Collection Configuration ..................................................... 63 1.5.- Content Indexing ........................................................................................................... 64
2.- Administration and Configuration with PowerShell ......................................................... 65 2.1.- Administration cmdlets ................................................................................................ 65 2.2.- Index Schema cmdlets .................................................................................................. 66 2.3.- Installation cmdlets ........................................................................................................ 67 2.4.- Spell Tuning cmdlets ..................................................................................................... 67 2.5.- Security cmdlets ............................................................................................................. 68
3.- Administration ............................................................................................................................ 69 3.1.- SharePoint 2010 Central Administration ................................................................ 70
3.1.1.- General Administration ........................................................................................ 70 3.1.2.- Crawling Administration ...................................................................................... 71 3.1.3.- Query Administration ........................................................................................... 72 3.1.4.- Reporting and Reporting Administration ........................................................ 72
3.2.- Administration Command-line Tools ....................................................................... 73 3.3.- Backup and Recovery ................................................................................................... 74
3.3.1.- Backup and Restore Prerequisites ..................................................................... 74 3.3.2.- Configuration Backup and Restore ................................................................... 75 3.3.3.- Full Backup and Restore....................................................................................... 76
4.- Monitoring ................................................................................................................................... 79 4.1.- FAST Logs ........................................................................................................................ 79 4.2.- WMI for monitoring ..................................................................................................... 80 4.3.- Performance Counters for monitoring .................................................................... 81 4.4.- Monitoring Command-line Tools .............................................................................. 82
CHAPTER 6: USER INTERFACE ........................................................................ 85
1.- FAST Search Center ................................................................................................................. 85 2.- WebParts for SharePoint 2010 .............................................................................................. 88
2.1.- Search WebPart Gallery .............................................................................................. 88 2.1.1.- Search Box WebPart ............................................................................................ 90
Contents vii
2.1.2.- Core Results WebPart ......................................................................................... 92 2.1.3.- Refinement Panel ................................................................................................... 94
2.2.- Customizing Search WebParts ................................................................................... 97 2.2.1.- XSLT Transformations ......................................................................................... 97 2.2.2.- Properties Manipulation ..................................................................................... 101
2.3.- Customizing Non-sealed Search WebParts .......................................................... 102
CHAPTER 7: PROGRAMMING ......................................................................... 107
1.- Working with the Search API .............................................................................................. 107 1.1.- Administrating FAST Programmatically ................................................................. 107 1.2.- Querying FAST Programmatically ........................................................................... 111
1.2.1.- The Federated Object Model ........................................................................... 112 1.2.2.- The Query Object Model .................................................................................. 113 1.2.3.- The Query WebService ..................................................................................... 116
1.3.- Content API for FAST ................................................................................................ 119 2.- Customize The Content Pipeline with the Extensibility Stage .................................... 121
2.1.- Crawled properties, Managed properties and Crawled property categories
121 2.2.- Creating the Logic of the Pipeline Extension ....................................................... 122 2.3.- Configuration of the Pipeline Stage ......................................................................... 124
3.- Adding custom Refinement Panels ...................................................................................... 126 4.- Building Search WebParts providing FQL capabilities ................................................... 130
INDEX .................................................................................................................. 133
Preface
FAST is the Enterprise Search solution from Microsoft and it is taking quickly a
very important role in the offer of the company's enterprise servers. With its integration
in the SharePoint 2010 family, FAST bids a scalable, flexible and powerful search
server that not only contents with other similar commercial software but that can pick
up the gauntlet and surmount easily any other product.
This book is oriented to technical audiences that need to design, install, configure
and customize a FAST Search implementation. More general themes are handled in the
first chapters: wide-ranging information about search, the past-and-future of search, a
short history of FAST and explanations about the very specific definitions and concept
used by search engines; because search is intimately related to human linguistics and
how people organize information, special attention has been given to how the internal
algorithms can be interpreted from an information technology perspective, not from a
pure technical point of view.
Installation and configuration are managed in the following chapters. Although the
installation procedure trails the traditional friendly installation routines of all Microsoft
products, there are some important aspects that must be taken in consideration
especially for an enterprise FAST farm. The different configuration options
(SharePoint Central Administration, SharePoint Site Collection Administration, FAST
Object Model and FAST PowerShell console) are reviewed to explain the several
available ways to adapt the system to the enterprise requirements.
Finally, the default Search User Interface is assessed. Albeit the SharePoint Search
WebParts can be used by both, the SharePoint Enterprise Search and FAST, the
different WebParts are analyzed and the configuration and customization possibilities
are described because they form the main components that the day-to-day users will
experience.
Customization of the core search engine is one of the points that make FAST
different from the SharePoint Enterprise Search engine. In the current FAST version
the great part of customizations take place modifying XML files but some
programming is allowed and sometimes indispensable to ensure FAST is behaving as
required. The last book chapter deals with programming and customizing the engine
and it is mainly oriented to developers.
All-by-all the book offers a 360 degrees view of FAST and it is intended to be a
reference work for those people that are curious about FAST and the ones that must
deal with the server for the first time.
And remember: if you cannot find it, it doesn't exist...
Gustavo Vélez
11
CHAPTER
The Wikipedia defines "Search" as "software for finding information": that is a
short, concise definition of something that is becoming indispensable in our
information-driven society; namely, how to discover the necessary data and distinguish
relevant from irrelevant material.
Search as IT technology is at this moment one of the most important components in
each information system. Because computer systems are able to generate huge amounts
of information, everyday it is more and more difficult (and expensive) to reach the
appropriate information. Search technologies enable us to work in a smarter way,
reusing the data that already exists.
But, on the other hand, our society is also becoming more and more addicted to and
dependent on information search technologies, making the knowledge society reliant
on search services and their quality to work correctly; saying in other words, if you
cannot find it, it doesn't exist.
1.- SEARCH IN THE IT WORLD
Search is not a new issue in the IT world. Since computers have been saving data
electronically, it has been a necessity to get the correct records back. Theoretical work
started as early as 1945 when Vannevar Bush as Director of the Office of Scientific
Research and Development in USA after the Second War World, stressed the necessity
of creating an information device (that he called a "Memex") to allow a memory
storage retrieval system without limits, flexible and associative.
Gerard Salton from Harvard University is considered the father of the modern
search technologies. After the publication of his book "A Theory of Indexing", where
Introduction
1
12 Lightning FAST Enterprise Searches in Sharepoint 2010
base concepts as Document Frequency, Term Frequency, Term Discrimination and
Relevancy were defined, the mathematical and theoretical foundations for search
algorithms found their place.
With the creation of ARPANet in 1972 and the start of Internet as we currently
know it in 1993, the necessity for a search mechanism was urgent. In 1990 the first
search engine was created: Archie fashioned by Alan Emtage, a student at McGill
University in Montreal. Archie was merely a script data collector that used regular
expressions to retrieve file names matching the user queries.
Because Archie was a big success, new search systems starting to appear to fill the
gaps left. Veronica was created at the University of Nevada that, besides the same use
as Archie, was also able to index the content of plain text files. In a short time Jughead,
a clone of Veronica was created with a more advanced user interface. At this time
Gopher and FTP where the main transfer protocols used and ARPANet was principally
an academic initiative.
On August 6 1991 Tim Burners-Lee at the CERN created the first page using the
WWW protocol; at the same time, the Virtual Library (http://vlib.org/, still existing),
the first and oldest sites catalog was online. Very soon the first crawlers were
implemented and in June 1993 Matthew Gray presented the "World Wide Web
Wanderer" initially to measure the active web servers, but soon becoming "Wandex"
the first data base created to capture URL's.
By the beginning of 1994 Internet was three search engines rich: "World Wide Web
Worm", JumpStation and RBSE (the "Repository Based Software Engineering"
spider). The only one that had a ranking mechanism was RSBE. The other two listed
their results as they were found without any discrimination, making them impossible to
use when the WWW grew exponentially. In 1993 Excite was also born, the first search
engine that used statistical analysis and word relationships to improve the search
mechanism. Excite had a huge success and was sold in 1999 for $6.5 billion (and sold
again in 2001 for $10 million, after the Internet crash).
1994 saw the birth year of Yahoo! as well (David Filo and Jerry Yang), initially as a
collection of web pages and shortly after creation, making the jump to
commercialization in the model that we know currently. Lycos, Hotbot and Altavista
went online the same year, making the change from web pages catalogue to crawled
search mechanisms, allowing new technologies as natural language queries. All of
these search engines become eventually irrelevant because of technical, financial and
management reasons.
Finally in 1998 Larry Page and Sergey Brin launched Google at Stanford
University, based on its early work BackRub. The same year Microsoft set up MSN
Search online and in 2006 Microsoft announced Bing using its own created search
technologies (MSN Search was based mainly on Yahoo!, Overture, Looksmart and
Inktomi).
Although web search is very important, enterprise search is occupying a prominent
role in the search market. Currently all the big software companies (IBM, Google,
EMC, SAP and, of course, Microsoft) have one or more enterprise search offering.
Some extra technical information about the similarities and differences between web
and enterprise search will be analyzed in the second chapter.
Introduction 13
Search seems to be a static world seen from the perspective of the users, but it is a
very dynamic world from the technical and business perspective. At the web search
front, the battle between Google and Bing is beginning to become legendary: the
underdog against the huge establishment. At the enterprise search front, the roles are
more equally distributed, with FAST gaining more momentum especially because of
the growing influence of Microsoft in the business world.
Technically the future is completely open. Currently search is still essentially about
finding primary topics or noun-phrases: a person's name, a city, a product and so on.
The future of search should be finding verbs, called by Microsoft as the "decision
engine" (as opposed to "search engine"): search will try to give the user the knowledge
to complete tasks doing the initial computational discerning automatically.
New classes of information are starting to be also more important; social network
data for example, or location data and the interconnection between all layers of
information. Currently a user normally searches for a term or number of terms: "fast
search" and the result is a mash-up of information that has something to do with "fast"
(any kind of fast: fast food, fast cars, FAST search) and "search". As search engines
become more "intelligent", they should add other layers of information, for example
the kind of user ("user is an IT-pro), his current geographical location ("user is at the
office") and filter the results to show a much more consequent and useable set of
information. Additionally, the search engine could prepare the information in report
form, setting it directly in Word format for example. The search engine should become
in part intelligent software and in part assistant and less of an information reader only.
The progression of search is from merely data to useful information to knowledge that
answers questions.
2.- SHORT HISTORY OF FAST
Till Google provoked a landslide in the search word in 2002, Microsoft was not
really aware of the importance of search for the IT industry. Until then, Microsoft used
different third-party technologies for web search and had one "enterprise" search
engine used locally for Windows and for some of its servers (namely Search Server for
SharePoint). Gartner's Magic Quadrant for Information Access reflects this position in
its 2006 report as Figure 01 shows: Microsoft is impossible to be found in the diagram.
14 Lightning FAST Enterprise Searches in Sharepoint 2010
Figure 01.- Gardner Magic Quadrant for Information Access 2006
In 2006 Microsoft stepped up the company strategy for Search for the next few
years announcing that search should be of vital importance for the company and all its
servers. Three years after that, the Gardner Magic Quadrant would show a very
different panorama, as shown in Figure 02: Microsoft is in the most important part of
the diagram, the "Leaders" quadrant. And that was possible thanks to the acquisition of
FAST in 2008.
Figure 02.- Gardner Magic Quadrant for Information Access 2009
Introduction 15
From this date till the present day, Google and Microsoft FAST have remained
approximately in the same position in the quadrants. Google stays as the first player in
the web search market and Microsoft is very busy converting all the constituent base
technologies used originally by FAST to Microsoft technologies and integrating FAST
in the Microsoft Stack, principally SharePoint.
FAST was originally a Norwegian company focused on enterprise data search
technologies and its application. Microsoft bought the company on April 24 2008.
FAST was born at the desks of the Department of Computer and Information Science
of the Norwegian University of Science and Technology (NTNU) in 1997 and
launched the first version of the engine in 1999. Initially FAST had versions for web
and enterprise search, but in 2003 they decided to focus exclusively on enterprise
search.
At the beginning of 2004 FAST launched the FAST Enterprise Search Platform
(FAST ESP). The next year FAST found its reputation in the enterprise search world as
probably the best and technologically most advanced engine in the mark and FAST
appears in the Gartner Magic Quadrant for Information Access Technologies in the
"Leaders" Quadrant for a number of years in a row. Nevertheless, FAST was almost
never financially profitable and legal problems troubled the company continuously,
finalizing in the suspension of trading of FAST shares in the Oslo Stock Exchange in
December 2007.
January 8, 2008 Microsoft announced the acquisition of FAST Search & Transfer
for $1.2 billion, making a separate division in the company to house FAST.
FAST ESP was probably the technological leader of enterprise search engines,
offering Contextual Insight (a group of technologies that add linguistic and statistical
analytics to improve search precision), semantic indexes (to recognize and retain the
inherent structure of documents), entity metadata, taxonomic navigation, faceted
browsing and entity discovery (to extract textual entities from the results of previous
search) under other advances.
Originally, FAST ESP was an agnostic system: it was possible to install it in
Windows, Unix and Linux systems, 32 and 64 bits, and it was written in Java, PHP and
Python. It had its proprietary administration interface, user interface, alerting system,
connector mechanism and different other subsystems, but it was possible to integrate
the query and results in SharePoint 2007 using WebParts. Since FAST was bought by
Microsoft, the main change in the server has been the attempt to integrate its code base
with the Microsoft toolset and make it to work smoothly with the rest of the Microsoft
Stack. That means Java and Python code have been changed to Microsoft DotNet
compatible technologies, SQL is used extensively and SharePoint 2010 is becoming the
default interface.
3.- POSITIONING OF FAST IN THE MICROSOFT STACK
Although currently Microsoft has different search engines and versions, FAST is its
most powerful engine and the enterprise preferable offer. As for each of its enterprise
16 Lightning FAST Enterprise Searches in Sharepoint 2010
servers, FAST is part of an ecosystem and impossible to work as a stand-alone product.
FAST relays on Windows as Operating System, SQL Server as its repository
mechanism and SharePoint as user and administration web interface. Besides that,
products as Microsoft System Center Operations Manager (SCOM) could be used to
control the availability, performance, configuration and security of FAST, and
Microsoft Forefront Threat Management Gateway (Forefront TMG) would be
necessary to protect FAST from outside threats. Other Microsoft products such as IIS
could be necessary as underground services for one of more of the servers.
3.1.- Windows, SQL, SharePoint, SCOM
Originally, FAST was developed as an agnostic system that could be installed in
Windows, Unix or Linux systems. Being the key Microsoft search technology means
that it must be specifically target to be implemented under Windows, specifically
Windows 2008 Server (64 bits) and up. FAST 2010 can be installed only under
Windows as Operating System (FAST ESP is consider legacy software, not supported
anymore). FAST doesn't demand special conditions for the Operating System; the
requirements are more hardware oriented, as it will be explained in the design chapter.
An SQL Server is required by FAST to maintain the configuration information.
SQL 2008 and up (64 bits) can be used, and FAST should require a modest part of the
database server performance and capability. All data necessary for indexing is not
stored in the database.
SharePoint 2010 is the User Interface and Administrators Interface of FAST, and in
this way, necessary to run properly FAST; but FAST is independent of SharePoint and
could (and should) be installed in separated servers. Both standalone and farm
installations of SharePoint 2010 can be used and an Enterprise license of SharePoint
2010 is indispensable. If document preview is desirable, to see thumbnails of Microsoft
Office Word and PowerPoint in the search results from FAST, Microsoft Office Web
Apps must be installed on the SharePoint servers.
SCOM is not required for the normal working of FAST, SQL or SharePoint, but as
the Microsoft strategic system center operation manager, SCOM is the recommended
monitoring solution for FAST. FAST support a number of monitoring services that
provide data using standardized Windows interfaces; SCOM can consume this data
giving the required protection from the operations perspective.
3.2.- Microsoft Search Products
Currently as of anno 2011, Microsoft have a variety of search offers, varying from
the low-cost/low-functionality of Search Express to the high-end FAST:
Microsoft SharePoint Foundation 2010 Search. Integrated in SharePoint
Foundation 2010 allows search scoped to single SharePoint Site
Collections and it cannot crawl external data sources. It has no
Introduction 17
administration User Interface and all the configurations happen
automatically. Scales to approximately 10 million items (using SQL
Server) for each search server
Microsoft Search Server Express. Free product that allows search over
enterprise content. Can crawl external data sources (web sites, file shares,
Exchange, Lotus Notes) and can federate query results from any
OpenSearch system. Deployment is limited to one server and can use SQL
Server Express (300.000 search items) or SQL Server (10 million search
items)
Microsoft Search Server 2010. Provides almost the same search
functionality of Microsoft SharePoint Server 2010 and can be deployed
across multiple servers for redundancy and increase of capacity and
performance. Supports multiple crawl servers and query servers and scales
to approximately 100 million items
Microsoft SharePoint Server 2010. The search engine embedded in
SharePoint Server 2010, making use of all social networking and managed
taxonomy features of SharePoint: indexing of people Profile database,
search in MySites, takes advantage of user-generated tags, managed
taxonomy to influence ranking, etc. Scales to approximately 100 million
items and can be installed in multiple servers and be used in multi-tenant
hosting environments
Microsoft FAST Search Server 2010 for SharePoint. Includes all the search
features of the other Microsoft search systems (except the Social features
of SharePoint 2010) adding almost unlimited scalability and performance.
Content processing is much more flexible and customizable. FAST consists
of three different versions:
o FS4SP: FAST Search for SharePoint 2010, the version packaged
for the SharePoint 2010 environment
o FSIA: FAST Search for Internal Applications aimed at
organizations that must crawl internal content
o FSIS: FAST Search for Internet Sites that allows crawling of
online information
It is important to consider that the differences in versions are merely a
licensing issue, the kernel engine and functionality is the same in all
versions. FAST ESP is considered legacy software and no longer available,
but customers who have currently Maintenance & Support contracts can
upgrade to either FSIA or FSIS.
18 Lightning FAST Enterprise Searches in Sharepoint 2010
Although little is known about the technology behind Bing, the web search engine
of Microsoft, it is indisputable that some aspects of Bing are directly related to FAST.
MSN Search (and later Windows Live Search), the web search engine before Bing,
used mixed technologies from AltaVista, Yahoo! and Inktomi. Bing uses suggestion
for queries and related searches based on semantic technology from Powerset, a
company purchased by Microsoft in 2008, but its search algorithms are property and
very secret.
Always a difficult question is what the right choice is: SharePoint Search or FAST
Search? The answer is always specific to the organization data landscape and
user/functionality needs. A quick differentiation between the two products is the
required search capacity and the necessity of customization. SharePoint Search has a
theoretical limit of 100 million search items, but the real-life edge should be much
lower. The theoretical limit of FAST is 500 million but with the right hardware and
topology could go over this figure.
Customization is the second criterion, but it could be the most important. The
search engine of SharePoint cannot be modified or adapted, meaning that adapting the
ranking mechanism or the indexing and querying tool should be impossible. FAST
allows many customizations making it much more flexible and adaptable to the
enterprise requirements.
In any case, choosing between SharePoint Search and FAST must follow the
indispensable design steps of any design: gain full understanding of the business
requirements, understand the data background in terms of quality, format and volume
and capture the search needs of the customer. The analysis of these factors should
indicate the right technology to use and provide the costs, risks and cost of ownership
estimations.
4.- SOME IMPORTANT DOCUMENTATION
Information about FAST is becoming better and more accessible. Because of the
close character of FAST ESP, the FAST version before Microsoft bought the company,
it was nearly impossible to find any kind of information about installation,
configuration, programming or use. Since release of the last version together with
SharePoint 2010, the flow of information from Microsoft has improved considerably.
The next list of documents from Microsoft is limited, but they represent the most
important information delivered from Microsoft about FAST for SharePoint 2010. The
list is limited to official Microsoft information, but slowly more and more independent
information is appearing in Internet from other sources.
FAST Software Development Kit (SDK) – Probably the most important technical
document about FAST. Several parts of this book are based on the information
delivered from the SDK, especially the configuration chapter. The SDK can be found
online in the site of the TechNet Library (http://technet.microsoft.com/en-
us/library/ee781286.aspx)
Introduction 19
Microsoft FAST Search Server 2010 for SharePoint Enterprise Search Evaluation Guide - This evaluation guide is designed to give business decision makers
and IT professionals an understanding of the design goals and the details of the
enterprise search features provided by Microsoft FAST Search Server 2010 for
SharePoint
(http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=24972)
FAST Search Server 2010 for SharePoint Capacity Planning - This white paper
describes the performance and capacity impacts in relation to FAST Search Server
2010 for SharePoint. This white paper includes information about the performance and
capacity characteristics of the feature and how it was tested by Microsoft
(http://www.microsoft.com/downloads/details.aspx?FamilyID=65B799E3-825C-
4398-8CD7-3311D3297997&displaylang=e&displaylang=en)
Download Microsoft FAST Search Server 2010 for SharePoint Trial – 120 days
Trial version of FAST (http://technet.microsoft.com/en-us/evalcenter/ee424282)
21
CHAPTER
Search is intrinsic to human nature, humans are searching continuously; as a
consequence, the concept of search is intuitively recognized. The term search is related
to the process of finding solutions to yet unsolved problems. In computers, search is
used almost as generally as in the human context: each algorithm searches for the
completion of a given task.
1.- GOALS OF SEARCH
Search has been an important part of computers since its very beginning, as the core
technique to solve problems. In general, search can be applied to many problems, from
solving games (chess has an expected search space of about 1044
possibilities, making it
possible for the IBM "Deep Fritz" computer to be able to find the correct answers to
win against the human world champion in 2006, evaluating about 10 million options
per second), to many industrial route planning systems that use search to answer
shortest- and quickest-route queries in fractions of the time that other algorithms can do
it.
Search algorithms can be used to solve optimally sequence alignment problems in
biology, to guide industrial robots in unknown environments or to find bugs in software
(using very similar search patterns as those used to find the most successful strategy to
win in chess). In summary, search and search algorithms are extensively used in the
real-world domain, although this book scope is limited to search information in the
search space of computer saved data.
FAST in the context of
Search
2
22 Lightning FAST Enterprise Searches in Sharepoint 2010
2.- INTERNET SEARCH VS. ENTERPRISE SEARCH
For information search purpose, traditionally there is a division made between
Internet search and Database search. The differences are more related to unstructured
information (information that has no formal relationship together) versus information
that has a more relational character. But lately it is very clear that both concepts are
fusing to something more general as Enterprise search.
Internet search is designed to go across web pages and documents, looking for new
and changed information and making indexes of everything they can find. The engines
for this kind of search are made to follow a process (a "Pipeline") that goes from
crawling to discover the sources of information, through indexing their content in a
structured way (in a database, xml files or any other form) finalizing with the
mechanism to resolve the user queries and deliver the results.
Database search is an integrated mechanism of each modern Database (see for
example the "Full-Text Search" functionality of Microsoft SQL Server). Technically
speaking the search mechanism also needs to have an index on the tables based on one
or more columns in the table. The databases that allow this type of search provide as
well language-specific linguistic components, including word breakers, stemmers and
thesaurus files, allowing the use of queries with full-text predicates such as "contains"
or "like", so that the user can perform a variety of types of searches (search for a single
word or phrase).
Most of the currently used search products are large internet search engines
optimized to crawl web-pages and documents using the capabilities of database search
engines, thus Enterprise search engines. They can search both structured and
unstructured data sources: web-pages and documents are crawled, discovered and
indexed in separated indexes. Search results are generated on-the-fly for the users
querying the indexes in parallel and organizing the results following predetermined
rules.
One further differentiation between web-pages search engines (like Bing or Google)
and document oriented search engines is the capabilities and speed of indexing
documentation: the second type is made able and optimized to open and understand the
structure of documents (using iFilters for example) natively. At the other side, the
crawlers of Internet search engines are considerably different in comparison with the
document search engines, because they must travel across a completely different
environment (IP addresses and www technologies).
A main distinction point is the Relevance of search results. The overlaps within the
context of Enterprise search are very different from those applied to Internet search.
Enterprise search cannot take advantage of the very rich structure of links as is found
on the www hyperlink content. Algorithms that exploit the hyperlink structure to build
the information ranks are more suitable to be exploited than the query-independent
factors used by Enterprise search, such as document date or popularity.
FAST in the context of Search 23
3.- SEARCH TERMINOLOGY AND CONCEPTS
Search engines use their unique concepts and terminology. Because search has a
very strong language component, almost all the terminology is lent from the linguistic
and philological study fields.
Authoritative Page – Page designated as more relevant than other pages (for
example the home page for the intranet of an organization). The higher the
authoritative assigned level, the higher the ranking of the page in the search results.
Best Bets – Hand-created list of keywords for common queries that can
dramatically improve the search experience, particularly on information-rich sites such
as intranets. Best Bets are presented prominently at the beginning of the search results,
followed by the rest of the matching pages. Implementing Best Bets is an effective way
to improve the quality of search results.
Content Source – Options specific to a precise content to be crawled, including its
start address. A Content Source for SharePoint can contain up to 500 start addresses.
Crawl – Crawl is the methodical and automated manner used for search engines to
find information. Crawlers are computer programs in charge of the crawling. Crawlers
mainly create a copy of all the visited web-pages and documents for later processing by
the search engine that will index to provide searches. Crawlers are often used for
automating other tasks as links and source code (HTML) validation and to gather
specific types of information like E-mail address for example. The crawlers are
responsible for the freshness of the information that the search engine can use. Because
crawling a huge amount of information can take weeks or months, by the time a
crawler has finished its crawl, many events could have happened (creation, update,
deletion of information); for the search engine there is a cost associated with not
detecting this events and having outdated copies of the information.
Crawlers can have also an impact on the performance of the servers that maintain
the information: if the crawlers are requiring huge amounts of webpages or documents
from a system, they can have a crippling impact on the performance of the servers.
General speaking, web search crawlers architecture are pretty much unknown
(Yahoo!Slurp the Yahoo Search crawler, Googlebot from Google or Bingbot from
Bing), but the crawlers for Enterprise search are well-documented (the FAST crawler
for example). Crawl Queue refers to the data structure that stores the list of items to be
crawled. Crawl Rule is a set of preferences that applies to a specific Content Source
and it is used to include and exclude items in a crawl. Crawled Property means a type
of metadata that can be discovered during a crawl and applied to one or more items and
can be promoted to Managed Property. A Managed Property is a specific property in
the metadata schema that can be made available for queries.
Duplicate and Duplicate Result Removal – Refers to identical or near identical
content that should be removed from the search results.
24 Lightning FAST Enterprise Searches in Sharepoint 2010
Entity Extraction – Seeks to locate and classify elements in text into predefined
categories such as people's names, organization, location, expressions of times,
quantities, monetary values, percentages, etc.
Faceted Search – It is a filtering technique to access collections of information
represented using classifications with some common significance. Allows users to
narrow down the search results. Also known as Navigators or Refiners.
Federation – Allows simultaneous search of multiple searchable resources. A
Federation establishes a collaborative link in between different search systems,
allowing the systems to query other search engines without the necessity to maintain
indexes of the external systems, arranging the results from the various sources into a
useful form and presenting them to the user. When the search data model of the search
system is different from the data model of the foreign target system, the query must be
first translated and the users' credentials must be passed to maintain the appropriate
security. On the return side, the results need to be mapped back from the foreign
system to the search engine form to be rendered to the user. Scalability and
performance are always a source of concern in Federation: the query performance and
results quality are totally dependent on the foreign search engine.
High Confidentiality – A Managed Property identified as a good indicator of a
highly relevant item
iFilter – An iFilter is a translator that teaches the search engine the structure of
documents to be indexed. Without an appropriate iFilter, contents of a file cannot be
parsed and indexed by the search engine. Windows Indexing Service, MSN Desktop
Search, Internet Information Server, SharePoint Server, Site Server, Exchange Server,
SQL Server and all other products based on Microsoft Search technology support
indexing technology based on iFilters.
Index – Indexing is the process of extracting information from the original data
source and saving it in a format that the search engine can understand. The index is
structured in such way that the engine can find quickly the information that contains a
particular term. Indexing can be a complex process that uses a lot of resources of the
search servers. During the indexing not only the constituent words of the source are
extracted, but the language, the boundaries of sentences and paragraphs, changes in the
case and stemming of the words into their roots are determined. Normally the indexing
process is continuous to refresh the complete index frequently. For Internet search it is
usual to have a limit on the information indexed for each page and an algorithm
decides which sections of the page are relevant to be indexed (to prevent overload of
the web servers that contain very large pages such as technical manuals). On the
contrary, for Document search it is important to index as much as possible information,
and normally the limit (if it exists) is very high.
FAST in the context of Search 25
Information Extraction – The study that attempts to identify semantic structures in
order to excerpt relevant data. It describes the techniques to develop systems to index
and search vast amounts of data effectively. The goal is to automatically extract
structured information from unstructured documents.
Inverse Document Frequency (IDF) – A measure of how rare a term is in a
collection of documents, calculated by total collection size divided by the number of
documents containing the term. Common terms ("the", "and" etc.) have a very low IDF
and are often excluded from search results. These low IDF words are commonly
referred to as "stop words".
Keyword – A word used in a query. In web search, Keywords are targeted based on
what users looking for in the HTML of the pages. In Enterprise search, Keywords can
be configured to target specific terms relevant for the specific company.
Keyword Density – A measure of the percentage of words in a document that are
specifically chosen as keywords of the total number of currently present words. The
ranking is based on (amongst many things) the percentage of words on a page that are
similar to the words used in the query.
Latent Semantic Indexing (LSI) – Also known as Latent Semantic Analysis. It is
an indexing which switches the current lexical functioning of every search engine to a
semantic one. It uses a mathematical technique (Singular Value Decomposition) to
identify patterns and relations between the terms contained in a text. In this way it is
possible that a query returns results which do not contain the keywords searched.
Search engines are heading to LSI to ensure more human accurate results.
Lemmatization – Is the process of grouping different forms of words so that they
can be analyzed as a single item. A lemmatization algorithm determines the "lemma"
for a word; that means it understands the context of the word and determines its role in
a sentence. Following the example given for Stemming, "playing", "player" and "play"
should be lemmatized to the lemma "play" as well. The difference with Stemming is
that the stemmer has no knowledge about the context of the word in the sentence and
therefore cannot discriminate between words which have different meanings depending
on their position or use in the sentence. Taking a different example, the words
"improved" should have "good" as lemma and a complete different stem.
Lemmatization can be very difficult to implement as it is not only language-dependent,
but also culture-dependent (one lemma can be different in the same language but also
in different countries).
Link Map – A Link Map is a graph structure of the nodes connected by links in
Internet search. The map facilitates the fast access to the data, the popularity score of
the page and the ranking algorithm.
26 Lightning FAST Enterprise Searches in Sharepoint 2010
Natural Language Processing (NLP) – A system that allows search engine users
to type a question rather than keywords. This can be reached, at the simplest level,
making the search engine remove the stop words in the question to leave keywords that
are then processed as if it was a regular query. At the other end of the scale are
advanced systems that use statistics and linguistic analysis to accurately match the
available indexes to the user's question.
Partial Word Matching – Some search engines will consider not only exact
matches, but also partial matches. This means that if the search term is contained
within a word in a document in its index, the search engine considers the document a
match. Strongly related to lemmatization and Stemming.
Phrase Search – A type of search that allows users to search for documents
containing an exact sentence or phrase, rather than single keywords. Important point
here is that in a phrase search the words have to appear side by side in the document
(exactly as in the query) to be considered a match. If the words appear dispersed or
they appear side by side but in the wrong sequence, it is not considered a match. Phrase
searching can be done on most search engines by simply enclosing the phrase in
quotation marks. Anti-phrasing means phrases for which there is no value in indexing
(for example “What xxx means”).
Pipeline – Specially tailored FAST architecture to address the challenges of
flexibility versus the inherent shortcomings of any search engine. The FAST Pipeline
(format conversion, language detection, stemming, entity extraction, lemmatization)
allows the introduction of custom plug-ins (stages) to enrich the data to be indexed; for
example, the entity extractor can be programmed to recognize entities that are
important to an organization.
Polysemy – One word can have several meanings. Language - dependent and very
difficult to address in algorithms.
Precision and Recall – Strongly related to the search accuracy, a simple metric that
computes the fraction of instances for which the correct result is returned. Search
engines often consider a document a match to a query when that document is not really
relevant to the query. These mistakes happen because search engines should conjecture
what the user means. Search engines must find a balance between recall (its ability to
find all relevant documents) and precision (its ability to find only relevant documents).
The aim is to retrieve all relevant documents and nothing else. Precision is scored by
dividing the total number of pages found by the number of relevant pages found. For
example, in a collection of 1000 documents if 100 documents are found and 60 are
relevant, the search engine's precision is 60%. In the same example, if the document
collection contains 70 hits that are relevant but only 60 were found, the Recall is 60/70
= 85%
FAST in the context of Search 27
Promotion / Demotion – Getting a search result to the top of the results rankings
means Promotion. The other way around is Demotion. In Enterprise search engines
there is always a configuration that can be implemented for Promotion and Demotion
of terms. Internet search engines have many different security mechanisms to prevent
the user from promoting or demoting sites in an illegal way.
Property Extraction – Allows the extraction of language-specific properties for
names (locations, company, people).
Ranking – Is the order by Relevance of the search results, so that the most relevant
ones come first. Relevance ranking mainly refers to the different features and
algorithms used to estimate the weight of documents and to sort them appropriately.
The most basic retrieval function is a Boolean query on the incidence of terms in the
information. Assuming a query “word1 word2” the Boolean AND query would return
all documents containing the word1 and word2 at least once. These documents
represent the set of potentially relevant documents: all documents not in this set could
be considered irrelevant and ignored. This step usually reduces the number of
documents to be considered for ranking, but it does not order the documents in the
result set. After that, each document needs to be “scored”: the document’s relevance
must be estimated as a function of its relevance features. Contemporary search engines
use hundreds of features as parameters to estimate the Ranking.
Relevance – How closely the search results that are returned to the user match what
the user wanted to find. Ideally, the results that are returned at the top are the most
relevant: the user does not have to look through several pages of results to find the best
matches for their search. In other words, Relevance describes how well a given search
satisfies a user’s information needs. The problem that search relevance attacks is to
estimate how pertinent a result is to a query. Commercial search engines combine
hundreds of features to estimate relevance. The specific features and their mode of
combination are often kept secret to prevent the user from forging the results.
Nevertheless, the main types of features in use, as well as the methods for their
combination, are publicly known and are the subject of scientific investigation.
Spelling Suggestions – ("Did you mean"). Type mistakes are very common when
users are typing search terms. The linguistic capabilities of modern search engines
allow the detection of the mistakes and the suggestion of related terms improving the
quality of searches. Spell checking exceptions can also be defined in FAST: the words
that are not found in the default spell checking dictionary but that are still valid.
Stemming – The process of reducing words to their stem or root form. An English
stemming algorithm should reduce the words "playing", "player" and "play" to the root
word "play". Stemming is a challenging task in the algorithm world and it is considered
as a difficult linguistic research field. Each language needs its own stemming
algorithms; some of them are more trivial that other, but the more complicated the
morphology and orthography of the language are, the more complex the stemming
28 Lightning FAST Enterprise Searches in Sharepoint 2010
becomes. Stemming is close related to Lemmatization. FAST map one form of a word
to its variants to enrich the query results.
Synonyms – Synonyms are different words with almost identical or similar
meanings. Depending of language, geographical origin and social-cultural status,
synonyms can have very different meaning because of etymology, orthography, phonic
qualities, ambiguous meanings, usage, etc. making them unique; this problem makes
Synonyms difficult to process by search engines. Normally Synonyms are presented to
the search engines as Thesauruses, lists of related words.
Term Frequency (TF) – A measure of how often a term is found in a collection of
documents. TF is combined with Inverse Document Frequency (IDF) as a means of
determining which documents are most relevant to a query. TF is also used to measure
how often a word appears in a specific document.
Tokenization – The process of splitting a text into individual words or tokens to be
indexed. All separation characters (spaces, commas, dashes, periods, etc.) are
considered delimiting characters and are excluded from the indexes. Tokenization is
dependent on the language and very important for Relevance.
4.- FAST VERSIONS
FS4SP: Fast Search for SharePoint 2010 is the FAST version packaged for the
SharePoint 2010 environment. Licensing is per Client Access License (CAL)/server.
FSIA: FAST Search for Internal Applications is aimed at organizations looking for
a standalone FAST implementation (not integrated with SharePoint) for internal use. It
is generally sold on a CAL/server basis.
FSIS: FAST Search for Internet Sites is aimed at online search applications. FSIS is
licensed per server.
FAST ESP: In release 5.3 as it was when Microsoft bought the product, it is the
last version of FAST before its integration in the Microsoft server stack. FAST
customers who are currently with FAST ESP Maintenance & Support can upgrade to
either FSIS or FSIA.
Microsoft divides the FAST family into two groups:
Search solutions for Business Productivity:
o Microsoft FAST Search Server 2010 for SharePoint (“FS4SP”)
o Microsoft FAST Search Server 2010 for Internal Applications
(“FSIA”).
FAST in the context of Search 29
Search solutions for Internet Sites
o Microsoft FAST Search Server 2010 for Internet Sites ("FSIS").
o Microsoft SharePoint Server 2010 for Internet Sites, Enterprise (“FIS-
E”). This product includes rights to Microsoft FAST Search Server
2010 for SharePoint Internet Sites (“FS4FIS”).
From the Microsoft sales FAST information:
“FSIS and FSIA must be purchased from FAST or FAST resellers. They are offered
only from the FAST price list, under a FAST EULA, and FAST maintenance and
support options are available. FS4SP and FIS-E will be available through Microsoft VL
only. FAST maintenance and support are not available for these VL products, but
Microsoft support and SA are available.”
“All servers need license coverage, just like for SharePoint or ESP for SharePoint.
The appropriate way to achieve license coverage depends on what the server is used
for.
Production (includes active and fault-tolerance servers), staging, admin, and hot
and warm stand-by servers all require product licensing
Cold stand-by servers used for disaster recovery do not require product licenses
as long as the customer is current on M or SA. This is a benefit of M/SA and
customers who drop M/SA lose this benefit.
Development and testing servers can be covered in a few ways. Under
Microsoft VL, customers can choose to cover them with product licenses
(server/CAL for FS4SP; server for FIS-E) or via MSDN subscriptions. Under
FAST, these rights will be included in the base licenses for FSIS and FSIA.
Each user of FSIA must be covered by a CAL.
Each virtual machine on a physical server counts as a server and requires a
separate license. This matches Microsoft licensing for server technology hosted
in a virtual environment.”
(http://www.microsoft.com/pathways/fast/FAST%20License%20Grants.htm)
Síguenos y descubrirás los mejores trucos y recursos:
¿Te interesa este libro?Cómpralo en nuestra tienda: www.campusmvp.com
Especialistas en formación online y librosde tecnologías Microsoft.
- En papel o en formato electrónico
- Sin DRM- Imprimible- Busca en el contenido