Home » All About Indexing and Basic Data Operations – Part 1- Ultimate Solr Guide

All About Indexing and Basic Data Operations – Part 1- Ultimate Solr Guide

Hello Everyone! Today I would like to discuss a very important aspect about Apache Solr. To perform any basic search operation, we need an index. The index of apache solr rests in terms of a document which is basically a set of fields having certain values. Hence, indexing operation in solr becomes very crucial in building an application. Today, we will discuss the process of indexing: adding content to a Solr index and, if necessary, modifying that content or deleting it.

A Solr index can accept data from many different sources, including XML files, comma-separated value (CSV) files, data extracted from tables in a database, and files in common file formats such as Microsoft Word or PDF.

Here are the three most common ways of loading data into a Solr index:

  • Using the Solr Cell framework built on Apache Tika for ingesting binary files or structured files such as Office, Word, PDF, and other proprietary formats.
  • Uploading XML files by sending HTTP requests to the Solr server from any environment where such requests can be generated.
  • Writing a custom Java application to ingest data through Solr’s Java Client API (which is described in more detail in Client APIs). Using the Java API may be the best choice if you’re working with an application, such as a Content Management System (CMS), that offers a Java API.

Regardless of the method used to ingest data, there is a common basic data structure for data being fed into a Solr index: a document containing multiple fields, each with a name and containing content, which may be empty. One of the fields is usually designated as a unique ID field (analogous to a primary key in a database), although the use of a unique ID field is not strictly required by Solr. If the field name is defined in the Schema that is associated with the index, then the analysis steps associated with that field will be applied to its content when the content is tokenized. Fields that are not explicitly defined in the Schema will either be ignored or mapped to a dynamic field definition (see Documents, Fields, and Schema Design) if one matching the field name exists.

Introduction

Using POST Tool

Solr includes a simple command-line tool for POSTing various types of content to a Solr server.

To run it, open a window and enter:

This will contact the server at localhost:8983. Specifying the collection/core name is mandatory.

Using the bin/post tool

The basic usage of bin/post is as below:

Basic indexing operations

Indexing XML

Add all documents with file extension .xml to collection or core named test_col.

Send XML arguments to delete a document from test_col.

Indexing CSV

Index all CSV files into test_col

Index a tab-separated file into test_col:

The content type (-type) parameter is required to treat the file as the proper type, otherwise, it will be ignored and a WARNING logged as it does not know what type of content a .csv file is. The CSV handler supports the separator parameter, and is passed through using the -params setting.

Indexing JSON

Index all JSON files into test_col

Indexing Rich Documents (PDF, Word, HTML, etc.)

Index a PDF file into test_col

Automatically detect content types in a folder, and recursively scan it for documents for indexing into test_col.

Automatically detect content types in a folder, but limit it to PPT and HTML files and index into test_col

Indexing to a Password Protected Solr (Basic Auth):

So, this is it for today about indexing operations in solr. More will be covered about the same in the next post.

Index a PDF as the user “user1” with password “password”:

SimplePostTool

The bin/post script currently delegates to a standalone Java program called SimplePostTool.

This tool, bundled into an executable JAR, can be run directly using java -jar test_col/post.jar.

Uploading Data with Index Handlers

Index Handlers are Request Handlers designed to add, delete and update documents to the index. In addition to having plugins for importing rich documents using Tika or from structured data sources using the Data Import Handler, Solr natively supports indexing structured documents in XML, CSV and JSON.

The recommended way to configure and use request handlers is with path-based names that map to paths in the request url. However, request handlers can also be specified with the qt (query type) parameter if the requestDispatcher is appropriately configured. It is possible to access the same handler using more than one name, which can be useful if you wish to specify different sets of default options.

A single unified update request handler supports XML, CSV, JSON, and java bin update requests, delegating to the appropriate ContentStreamLoader based on the Content-Type of the ContentStream.

UpdateRequestHandler Configuration

The default configuration file has the update request handler configured by default.

Adding Documents

Adding XML documents

The XML schema recognized by the update handler for adding documents is very straightforward:

  • The <add> element introduces one more documents to be added.
  • The <doc> element introduces the fields making up a document.
  • The <field> element presents the content for a specific field.

For example:

 

Delete Operations

Documents can be deleted from the index in two ways. “Delete by ID” deletes the document with the specified ID, and can be used only if a UniqueID field has been defined in the schema. “Delete by Query” deletes all documents matching a specified query, although commitWithin is ignored for a Delete by Query. A single delete message can contain multiple delete operations.

Rollback Operations

The rollback command rolls back all add and deletes made to the index since the last commit. It neither calls any event listeners nor creates a new searcher. Its syntax is simple: <rollback/>.

Grouping Operations

You can post several commands in a single XML file by grouping them with the surrounding <update> element.

Using curl to Perform Updates

You can use the curl utility to perform any of the above commands, using its --data-binary option to append the XML message to the curl command, and generating a HTTP POST request. For example:

For posting XML messages contained in a file, you can use the alternative form:

Short requests can also be sent using a HTTP GET command, if enabled in RequestDispatcher in SolrConfig element, URL-encoding the request, as in the following. Note the escaping of “<” and “>”:

JSON Formatted Index Updates

Solr can accept JSON that conforms to a defined structure, or can accept arbitrary JSON-formatted documents. If sending arbitrarily formatted JSON, there are some additional parameters that need to be sent with the update request, described below in the section Transforming and Indexing Custom JSON.

Solr-Style JSON

JSON formatted update requests may be sent to Solr’s /update handler using Content-Type: application/json or Content-Type: text/json.

JSON formatted updates can take 3 basic forms, described in-depth below:

Adding a Single JSON Document

The simplest way to add Documents via JSON is to send each document individually as a JSON Object, using the /update/json/docs path:

Adding Multiple JSON Documents

Adding multiple documents at one time via JSON can be done via a JSON Array of JSON Objects, where each object represents a document:

A sample JSON file is provided at test_col/books.json and contains an array of objects that you can add to the Solr test_col example:

So, this is it for today about indexing operations in solr. More will be covered about the same in the next post.