Sphinx
======

This is the documentation for the full-text search extension for Newznab.
It is built on top of the very powerful Sphinx full-text indexer.  To learn
more about Sphinx go to: http://sphinxsearch.com/


Install
-------

To install the search indexing system for Newznab, follow the steps below.  If
you follow these directions carefully you shouldn't have any issues.  Read
through them at least once before actually doing anything to make sure you know
what's going to happen.

1)  Download and/or install Sphinx.  Make sure to get version 2.0.2 or higher:
    http://sphinxsearch.com/

2)  Create the necessary directories. By default the directory ``sphinxdata`` in
    newznab's db directory will be used.  If you don't need to specify a
    different location, you can skip this step and go to step 3.
    
    The first directory is where Sphinx will write the indexes to.  Make sure it
    exists and is writeable::
 
        mkdir /path/to/index/storage/dir
        chmod 755 /path/to/index/storage/dir

   The second directory is where Sphinx will hold it's binary log files. This
   directory needs to exist inside of the first directory.
   Again, make sure the directory exists and is writeable::
 
        mkdir /path/to/index/storage/dir/binlog
        chmod 755 /path/to/index/storage/dir/binlog

3)  Generate a 'sphinx.conf' file.  The ``nnindexer.php generate`` command
    takes a single optional argument: the path to the first directory you
    created in Step 2 (the indexes storage directory).  If you didn't create a
    custom directory in step 3, then type::

        ./nnindexer.php generate
    
    If you created a custom storage directory in step 2, pass the **first**
    directory you created as the first argument to the ``generate`` command::
    
        ./nnindexer.php generate /path/to/index/storage/dir
    
    The command will generate a ``sphinx.conf`` for you and will print out
    where it saved it to.  It will also update the "Sphinx Configuration Path"
    setting in the database.

4)  Login to the admin area of your Newznab install and set the Sphinx
    settings as desired.  Two very important options that you **must** have
    filled in correctly before proceeding are:
 
    a) "Sphinx Configuration Path" - The full path to the ``sphinx.conf``
        file that you just generated in Step 3.  Verify that this matches the
        value printed out by the ``nnindexer.php generate`` command that you
        ran in step 3.
    
    b) "Sphinx Binaries Path" - The full path to the location where you
       installed the Sphinx binaries to in Step 1.  If you're not sure where
       this is, you want the **directory** returned by the command
       ``which searchd`` or equivalent.  If you leave this blank, then it is
       imperative that the Sphinx binaries be installed to a location within
       your system's ``$PATH`` variable (or the Windows equivalent if not on a
       \*nix platform).
    
    It's worth mentioning that if you want the following commands to work, you
    need to make sure that "Use Sphinx" is set to "Yes".
    
    This is also a good time to decide which indexes to enable.  The default
    ``releases`` index is enabled by default and cannot be disabled (unless you
    disable Sphinx entirely).  For more information on the indexes and how much
    "effort" it takes if they are enabled, see the section "Available Indexes"
    below.  As a general rule, if you just want to speed up searching releases,
    leave the extra indexes disabled.  You can always enable them later if you
    want.

5)  Start the Sphinx search daemon (``searchd``)::

        ./nnindexer.php daemon

    Don't worry about any errors mentioning missing indexes, preload or
    "No such file or directory; NOT SERVING"--this is normal because we haven't
    indexed anything yet (that's the next step).
    
    .. important::
    
        You **must** run the daemon this as the same user that you run
        ``update_releases.php``.  If you don't do this, things will almost
        certainly not work correctly!

6)  Generate the initial indexes::

        ./nnindexer.php index full all
        ./nnindexer.php index delta all

    Depending on which items you've enabled for indexing, this step could take
    a while.

7)  Restart the search daemon now that we have created all of the indexes.
    Note that future updates will not require a restart of the search daemon.
    The only reason that we have to restart it this time is because the
    initial indexes didn't exist.  However, for future updates the indexes
    will be updated without any need to restart and with zero downtime because
    we take advantage of Sphinx's ability to "rotate" indexes::
    
        # Stop the search daemon...
        ./nnindexer.php daemon --stop
        
        # ...and restart it
        ./nnindexer.php daemon

8)  You're done!


Overview
--------

Below you'll find some useful information for understanding how Sphinx works
and how it is integrated into Newznab.

Full vs. Delta
~~~~~~~~~~~~~~

Sphinx is designed in such a way that every time you "index", you have to
actually "re-index"; you can't just simply update the index with only the new
data.  However, we obviously don't want to have to re-index such a large
dataset every time a new record is added.  So, to get around this issue, we
use "delta index" update scheme.  The way this works is fairly simple; for
every index we actually have two indexes: a "main" or "full" index and a
"delta" index.  The "main" index holds most of the indexed data, while the
"delta" index only hold the data that has been added/modified since we last
updated the "main" index.  Fortunately, Sphinx also provides a way to merge
indexes, so every so often (say once a day) we merge the "delta" into the
"main".  You can control this merge frequency via the "Merge Frequency"
setting from the site settings page.

For more information about how this works see the Sphinx website:
http://sphinxsearch.com/docs/2.0.2/delta-updates.html

Fields vs. Attributes
~~~~~~~~~~~~~~~~~~~~~

An important concept in Sphinx is the difference between "fields" and
"attributes".  "fields" store data that is directly retrievable from a search
string; this is the data that make the index a "full-text" search.
"attributes" contain data that gets attached to each record in the full-text
index.  While not directly searchable, "attributes" can be used to filter,
group and sort the results returned from the search.

Deleting Releases
~~~~~~~~~~~~~~~~~

The situation is further complicated by the fact that removing items from the
index is somewhat complicated.  As a simple remedy to this, there is the
"Rebuild Frequency" setting on the site settings page.  This setting controls
how often we do a full rebuild of the main index.  When the main index is
rebuilt, all of the deleted items will no longer be present in the index.  It
is also worth mentioning here that even though your index may contain items
that have subsequently been deleted from MySQL this won't have any visible
effect on the search results on Newznab's frontend.  The reason that we rebuild
is so that performance and integrity of the main index doesn't degrade over
time.


Available Indexes
-----------------

Currently there are 5 supported indexes.  You can enable or disable any of them
except for the main "releases" index.  They are, listed in order of difficulty
to index:

    * releases (main)
    * releasefiles
    * releasenfo
    * nzbs
    * predb

As stated, you can choose to enable or disable any of the indexes except for
"releases".  In order to decide which ones to enable/disable, below you will
find some information about each index which might help you make your decision.


Index: releases
~~~~~~~~~~~~~~~

This is the main index.  It indexes nearly all of the data contained within the
"releases", "bookinfo", "consoleinfo", "episodeinfo", "musicinfo" and
"movieinfo" tables in Newznab.  The "delta" index contains all releases that
have been added or modified since the last time the "main" index was updated.
This ensures that not just new releases are indexed, but also ones that were
updated as well.

The searchable fields are:

  * name
  * searchname
  * fromname
  * tvtitle
  * season
  * episode
  * bookinfo_title
  * bookinfo_author
  * bookinfo_publisher
  * bookinfo_review
  * consoleinfo_title
  * consoleinfo_publisher
  * consoleinfo_review
  * episodeinfo_showtitle
  * episodeinfo_eptitle
  * episodeinfo_director
  * episodeinfo_writer
  * episodeinfo_gueststars
  * episodeinfo_overview
  * musicinfo_title
  * musicinfo_review
  * musicinfo_artist
  * musicinfo_publisher
  * musicinfo_tracks
  * movieinfo_title
  * movieinfo_tagline
  * movieinfo_plot
  * movieinfo_director
  * movieinfo_actors
  * movieinfo_genre
  * predb_dirname
  * predb_filename
 
The attributes are:
  ======================= ==============
  size                      ``bigint``
  groupID                   ``uint``
  categoryID                ``uint``
  totalpart                 ``uint``
  grabs                     ``uint``
  completion                ``uint``
  regexID                   ``uint``
  rageID                    ``uint``
  tvdbID                    ``uint``
  imdbID                    ``uint``
  episodeinfoID             ``uint``
  musicinfoID               ``uint``
  consoleinfoID             ``uint``
  bookinfoID                ``uint``
  preID                     ``uint``
  anidbID                   ``uint``
  reqID                     ``uint``
  comments                  ``uint``
  passwordstatus            ``uint``
  rarinnerfilecount         ``uint``
  haspreview                ``uint``
  guid                      ``string``
  seriesfull                ``string``
  postdate                  ``timestamp``
  adddate                   ``timestamp``
  tvairdate                 ``timestamp``
  bookinfo_genreID          ``uint``
  bookinfo_pages            ``uint``
  bookinfo_cover            ``uint``
  bookinfo_asin             ``string``
  bookinfo_url              ``string``
  bookinfo_dewey            ``string``
  bookinfo_ean              ``string``
  bookinfo_isbn             ``string``
  bookinfo_publishdate      ``timestamp``
  consoleinfo_asin          ``string``
  consoleinfo_url           ``string``
  consoleinfo_salesrank     ``uint``
  consoleinfo_platform      ``string``
  consoleinfo_genreID       ``uint``
  consoleinfo_esrb          ``string``
  consoleinfo_releasedate   ``timestamp``
  consoleinfo_cover         ``uint``
  episodeinfo_rageID        ``uint``
  episodeinfo_tvdbID        ``uint``
  episodeinfo_imdbID        ``uint``
  episodeinfo_epabsolute    ``uint``
  episodeinfo_rating        ``float``
  episodeinfo_fullep        ``string``
  episodeinfo_link          ``string``
  episodeinfo_airdate       ``timestamp``
  musicinfo_salesrank       ``uint``
  musicinfo_genreID         ``uint``
  musicinfo_cover           ``uint``
  musicinfo_asin            ``string``
  musicinfo_year            ``string``
  musicinfo_releasedate     ``timestamp``
  movieinfo_imdbID          ``uint``
  movieinfo_tmdbID          ``uint``
  movieinfo_year            ``uint``
  movieinfo_cover           ``uint``
  movieinfo_backdrop        ``uint``
  movieinfo_rating          ``float``
  movieinfo_language        ``string``
  predb_ctime               ``uint``
  predb_nuketime            ``uint``
  predb_filesize            ``float``
  predb_filecount           ``uint``
  predb_category            ``string``
  predb_nuketype            ``string``
  predb_nukereason          ``string``
  ======================= ==============


Index: releasefiles
~~~~~~~~~~~~~~~~~~~

**Optional**

This indexes everything in the "releasefiles" table within Newznab.  An
important thing to note here is that due to the nature of the query needed
for this index, all the results need to be obtained in a single query.  As a
result, you're "releasefiles" table might become locked for an extended period
of time as this index is built.  However, depending on your database and
hardware, this might be a non-issue for you, so it is best to test it and see
what works.  A solution for this might be implemented in future versions.

The searchable fields are:

  * name    (a concatenated list of all the file names for a given release)

There are no attributes associated with this index.


Index: releasenfo
~~~~~~~~~~~~~~~~~

**Optional**

This indexes everything in the "releasenfo" table within Newznab.  Since the
NFOs can be fairly large documents of text, this index take considerably longer
to index than the others listed above and also requires more disk space.

The searchable fields are:

  * nfo

There are no attributes associated with this index.

Index: nzbs
~~~~~~~~~~~

**Optional**

This indexes the contents of all the NZBs.  You should think very carefully
about whether or not your machine is capable of dealing with this index as it
requires 2-3 orders of magnitude more disk space and processing time than all
of the other indexes **combined**.  With that said, this index also uses
Sphinx's "real-time" indexing functionality.  What that really means for you is
that once you have the data indexed, you won't ever really have to re-index it
(unlike the other indexes which **do not** work this way).

The searchable fields are:

  * file_names (a space-concatenated string of the file names)

The attributes are:

  * file_count (int)

Index: predb
~~~~~~~~~~~~

**Optional**

If you use the ``nzpre`` feature and you frequently search PreDB, then this
might be a worthwhile index for you.  Since the ``predb`` table can contain
many rows (3-5x as many as ``releases``), this might strain your system a bit.

The searchable fields are:
  
  * dirname
  * category
  * nuketype

The attributes are:

  * ctime (uint)
  * guid (string)
  * nfoID (uint)


API
---

.. toctree::
    :glob:

    ../dev/misc/sphinx/*