Categories
Internet linux Microsoft Web Trends

Solr

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world’s largest internet sites.
Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr’s powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.

Solr is a layer of code on top of Lucene that transforms Lucene into a search platform for building search applications. Solr was created by Yonik Seeley while at CNET and contributed to Apache by CNET. Solr provides the following capabilities:

1. Web service: Solr places Lucene over HTTP, allowing programs written in any language to invoke Lucene
2. XML-based schema for managing indexed fields and their characteristics
3. System administration tools for configuration, data loading, index replication, statistics, logging and cache management
4. Large scale distributed search
5. Fixed/paid result list placement
6. Faceting — the dynamic clustering of items or search results into categories that lets users drill into search results (or even skip searching entirely) by any value in any field, as seen on popular ecommerce sites such as Amazon

Most users building Lucene-based search applications will find they can do so more quickly if they start with Solr since it contains many of the capabilities needed to turn a core search capability into a full-fledged search application. Most of the more recent large Lucene-based installations mentioned above use Solr, including AOL, Comcast Interactive Media and Netflix, and of course CNET. However, as in any open layered environment, users can still choose to work directly with the underlying Lucene library, perhaps to manipulate or exploit lower level Lucene capabilities.

Feature List of Solr

1) Faceted search
2) Full-text search
3) Hit highlighting
4) Dynamic clustering
5) Sorting
6) Filtering
7) Spell checking
8) Elevation
9) Boosting at index and query time
10) “Did you mean” spell checking
11) Finding Documents that are “More like this”
12) Overriding search results based on editorial input (also known as paid placement)
13) Term
14) Term Frequency
15) Position (based on analysis)
16) Offset (character based)
17) IDF – Inverse Document Frequency
18) CopyField functionality allows indexing a single field multiple ways, or combining multiple fields into a single searchable field

Query

1 HTTP interface with configurable response formats (XML/XSLT, JSON, Python, Ruby)
2 Sort by any number of fields
3 Advanced DisMax query parser for high relevancy results from user-entered queries
4 Highlighted context snippets
5 Faceted Searching based on unique field values and explicit queries
6 Spelling suggestions for user queries
7 More Like This suggestions for given document
8 Constant scoring range and prefix queries – no idf, coord, or lengthNorm factors, and no restriction on the number of terms the query matches.
9 Function Query – influence the score by a function of a field’s numeric value or ordinal
10 Date Math – specify dates relative to “NOW” in queries and updates
11 Performance Optimizations

Cache in Solr

Solr caches are associated with an Index Searcher — a particular ‘view’ of the index that doesn’t change. So as long as that Index Searcher is being used, any items in the cache will be valid and available for reuse. Caching in Solr is unlike ordinary caches in that Solr cached objects will not expire after a certain period of time; rather, cached objects will be valid as long as the Index Searcher is valid.

The current Index Searcher serves requests and when a new searcher is opened, the new one is auto-warmed while the current one is still serving external requests. When the new one is ready, it will be registered as the current searcher and will handle any new search requests. The old searcher will be closed after all request it was servicing finish. The current Searcher is used as the source of auto-warming. When a new searcher is opened, its caches may be prepopulated or “autowarmed” using data from caches in the old searcher.
There are currently two cache implementations — solr.search.LRUCache (LRU = Least Recently Used in memory), and solr.search.FastLRUCache.

Admin Interface
1 Comprehensive statistics on cache utilization, updates, and queries
2 Interactive schema browser that includes index statistics
3 Replication monitoring
4 Full logging control
5 Text analysis debugger, showing result of every stage in an analyzer
6 Web Query Interface w/ debugging output
o parsed query output
o Lucene explain() document score detailing
o explain score for documents outside of the requested range to debug why a given document wasn’t ranked higher.

To summarize, Solr is not meant to be a replacement for your RDBMS. Rather, Solr should be used to develop the search service .Solr does a good job of searching and finding relevant items for a query. In truth all search engines can and should be tuned, Solr are no exception.


'Coz sharing is caring
Categories
Browsers Internet linux Microsoft Web Trends

Most Common Terms in Internet Industry


Contents

Absolute Path

An absolute path or full path is a unique location of a file or directory name within a computer or filesystem, and usually starts with the root directory or drive letter. Directories and subdirectories listed in a path are usually separated by a slash /.

Example: /Users/Matt/www/blog/images/icecream.jpg

To find the absolute path of a page, copy the text below into a new text file, save the file as path.php. Then open it in a Web browser (for example, http://www.example.com/images/path.php).


Absolute URI

A full URI.

http://www.example.com/blog/images/icecream.jpg
 ftp://ftp.example.com/users/h/harriet/www/

Apache

Apache is short for Apache HTTP Server Project, a robust, commercial-grade, featureful, and freely-available open source HTTP Web Server software produced by the Apache Software Foundation. It is the most commonly used web server on the internet, and is available on many platforms, including Windows, Unix/Linux, and Mac OS X. Apache serves as a great foundation for publishing WordPress-powered sites.

Array

An array is one of the basic data structures used in computer programming. An array contains a list (or vector) of items such as numeric or string values. Arrays allow programmers to randomly access data. Data can be stored in either one-dimensional or multi-dimensional arrays.

A one-dimension array seven (7) elements would be:

105 200 54 53 102 13 405

The Template Tag wp_list_categories() uses a one-dimensional array for the ‘exclude’ parameter.

An example of two-dimensional array, 7 by 3 elements in size, would be:

105 200 54 53 102 13 405
15 210 14 513 2 2313 4512
501 500 499 488 552 75 1952

ASCII

ASCII is short for American Standard Code for Information Interchange. Pronounced as “ask ee”, it is a standard set of codes used to represent numbers, letters, symbols, and punctuation marks.

Atom

A format for syndicating content on news-like sites, viewable by Atom-aware programs called news readers or aggregators.

Avatar

An avatar is a graphic image or picture that represents a user.

Back to the Top

Binaries

Binaries refer to compiled computer programs, or executables. Many open source projects, which can be re-compiled from source code, offer pre-compiled binaries for the most popular platforms and operating systems.

Blog

A blog, or weblog, is an online journal, diary, or serial published by a person or group of people.

Blogs are typically used by individuals or peer groups, but are occasionally used by companies or organizations as well. In the corporate arena, the only adopters of the blog format so far have tended to be design firms, web media companies, and other “bleeding edge” tech firms.

Blogs often contain public as well as private content. Depending on the functionality of the CMS software that is used, some authors may restrict access — through the use of accounts or passwords — to content that is too personal to be published publicly.

Blogging

Blogging is the act of writing in one’s blog. To blog something is to write about something in one’s blog. This sometimes involves linking to something the author finds interesting on the internet.

Blogosphere

The blogosphere is the subset of internet web sites which are, or relate to, blogs.

Blogroll

A blogroll is a list of links to various blogs or news sites. Often a blogroll is “rolled” by a service which tracks updates (using feeds) to each site in the list, and provides the list in a form which aggregates update information.

Bookmarklet

A bookmarklet (or favelet) is a “faux” bookmark containing scripting code, usually written in JavaScript, that allows the user to perform a function.

Boolean

A variable or expression which evaluates to either true or false.

Back to the Top

Category

Each post in WordPress is filed under a category. Thoughtful categorization allows posts to be grouped with others of similar content and aids in the navigation of a site. Please note, the post category should not be confused with the Link Categories used to classify and manage Links.

Capabilities

Term related to User authentication and access control. It is an adoption of permissions in RBAC. There are about thirty capabilities in WordPress. See Roles and Capabilities for a Description of the concept and a List of Capabilites.

CGI

CGI (Common Gateway Interface) is a specification for server-side communication scripts designed to transfer information between a Web server and a web-client (browser). Typically, HTML pages that collect data via forms use CGI programming to process the form data once the client submits it.

Character Entity

A character entity is a method used to display special characters normally reserved for use in HTML. For example, the less than (<) and greater than (>) are used as part the HTML tag structure, so both symbols are reserved for that use. But, if you need to display those symbols on your site, you can use character entities. For example:

use <  for the less than (<) symbol
use >  for the greater than (>) symbol

Character Set

A character set is a collection of symbols (letters, numbers, punctuation, and special characters), when used together, represent meaningful words in a language. Computers use an encoding scheme so members of a character set are stored with a numeric value (e.g. 0=A, 1=B, 2=C, 3=D). In addition, a collation determines the order (i.e alphabetic) to use when sorting the character set.

By default, WordPress uses the Unicode UTF-8 (utf8) character set for the WordPress MySQL database tables created during the installation process. Beginning with Version 2.2, the database character set (and collation) is defined in the wp-config.php file. Also note, the character set used for syndication feeds is set in the Administration > Settings > Reading panel.

chmod

chmod is a Unix/Linux shell command used to change permissions on files. Its name is a contraction of “change mode.”

Class

Classes are groupings of CSS styles which can be applied to any HTML element.
For classes in PHP, see the Class (Computing) article at Wikipedia and PHP Manual: Classes and Objects.

Collation

Collation refers to the order used to sort the letters, numbers, and symbols of a given character set. For example, because WordPress, by default, uses the UTF-8 (utf8) character set, and when the WordPress MySQL database tables are created during the installation process, MySQL assigns utf8_general_ci collation to those table. Beginning with Version 2.2, the collation (and character set) used by WordPress is defined in the wp-config.php file.

Comments

Comments are a feature of blogs which allow readers to respond to posts. Typically readers simply provide their own thoughts regarding the content of the post, but users may also provide links to other resources, generate discussion, or simply compliment the author for a well-written post.

You can control and regulate comments by filters for language and content. Comments can be queued for approval before they are visible on the web site. This is useful in dealing with comment spam.

Content

Content consists of text, images, or other information shared in posts. This is separate from the structural design of a web site, which provides a framework into which the content is inserted, and the presentation of a site, which involves graphic design. A Content Management System changes and updates content, rather than the structural or graphic design of a web site.

Content Management System

A Content Management System, or CMS, is software for facilitating the maintenance of content, but not design, on a web site. A blogging tool is an example of a Content Management System.

cPanel

cPanel is a popular web-based administration tool that many hosting providers provide to allow users to configure their own accounts using an easy-to-use interface.

CSS

CSS, or Cascading Style Sheets, is a W3C open standards programming language for specifying how a web page is presented. It allows web site designers to create formatting and layout for a web site independently of its content.

Back to the Top

Database

A database in computing terms is software used to manage information in an organized fashion. WordPress uses the MySQL relational database management system for storing and retrieving the content of your blog, such as posts, comments, and so on.

Default theme

Every installation of WordPress has a default theme. The default theme is sometimes called the fallback theme, because if the active theme is for some reason lost or deleted, WordPress will fallback to using the default theme.

Up to Version 2.9.2 the default theme was the WordPress Default theme (sometimes call Kubrick) and was housed in the wp-content/themes/default folder. Starting with Version 3.0, the Twenty Ten theme became the default (and fallback) theme.

Deprecated

Deprecated functions or template tags are no longer supported, and will soon be obsolete.

Developer

A developer, or dev, is a computer programmer who is active in creating, modifying, and updating a software product.

DIV

A DIV element in HTML marks a section of text. DIVs are used extensively in WordPress to apply CSS stylings to particular blog elements.

DOM

DOM (Document Object Model) is a standard, platform-independent interface that allows programmers to dynamically access HTML and XML to control the content and structure of documents. DOM connects programming scripts to web pages.

Draft

The draft post status is for WordPress posts which are saved, but as yet unpublished. A draft post can only be edited through the Administration Panel, Write Post SubPanel by users of equal or greater User Level than the post’s author.

Back to the Top

Excerpt

An excerpt is a condensed description of your blog post and refers to the summary entered in the Excerpt field of the Administration > Posts > Add New SubPanel. The excerpt is used to describe your post in RSS feeds and is typically used in displaying search results. The excerpt is sometimes used in displaying the Archives and Category views of your posts. Use the Template Tag the_excerpt() to display the contents of this field. Note that if you do not enter information into the Excerpt field when writing a post, and you use the_excerpt() in your theme template files, WordPress will automatically display the first 55 words of the post‘s content.

An excerpt should not be confused with the teaser, which refers to words before the

'Coz sharing is caring