Cognitive Search
What’s Cognitive Search?
We are still at an early stage of an evolving market, where many players try to position their offering as “cognitive” . According to Cognitive Computing Consortium a cognitive search information system is capable of extracting relevant information from big and diverse data sets for users in their work context.
In the era of this cognitive computing new search solutions combine powerful indexing technology with advanced natural language processing capabilities, machine learning algorithms and other cognitive skills in order to build an increasingly deep corpus of knowledge from which to feed relevant information to users in real-time. These cognitively enabled platforms interact with users in a more natural fashion, learn/progress as they gain more experience with data and user behavior, and proactively establish links between related data from various sources, both internal and external.
In the following Forrester’s diagram how we are evolving from a keyword search into cognitive search:
A new generation of search solutions that employ AI technologies are able to ingest, understand, organize, and query digital content from multiple data sources.
Key features of cognitive research to date:
- Handling a multitude of data sources and types: search is no longer just about unstructured text contained in documents and web pages. Cognitive search solutions can also accommodate structured data contained in databases and even nontraditional enterprise data like images, video, audio, and machine data.
- Using AI: The distinguishing characteristic of cognitive search solutions is that they use natural language processing and machine learning to understand and organize data, predict the intent of the search query, improve relevancy of results, and automatically tune the relevancy of results over time.
- Adding value to consumer products: consumer applications will embed cognitive search in any business process applications. Virtual assistants would be useless without powerful search behind the scenes, enterprises wishing to build similar applications for their customers will also benefit from cognitive search solutions. Cognitive search solutions will provide APIs that allow developers to embed the power of the search engine in other applications.
Cognitive search is developing extremely quickly through multi-billion dollar investments by superpowers like Microsoft and Google.
Microsoft announced Cognitive Search, an AI-first approach to content understanding. Cognitive search is a preview feature of Azure Search powered by Azure Search with built-in Cognitive Services. It pulls data from a variety of Azure data sources and applies a set of composable cognitive skills which extract knowledge. This knowledge is then organized and stored in a search index enabling new experiences for exploring the data.
The real-world data is messy, it often spans media types as text documents, PDF files, images, databases, changes constantly, and carries valuable knowledge in ways that is not readily usable. The typical solution pattern for this is a data ingestion, enrichment and exploration model. Each of these brings its own challenges to the table, from large scale change tracking to file format support, and even composition of multiple AI models. Developers can do this today, but it takes a huge amount of effort, requires branching into multiple unrelated domains (from cracking PDFs to handling AI model composition), and distracts from the primary goal. This is where Cognitive Search comes in.
Cognitive search adds AI to indexing workloads: data extraction, natural language processing, and image processing during indexing finds the latent information in unstructured or non-searchable content and makes it searchable in Azure Search.
AI integration is through cognitive skills that enrich source documents through sequential processes, in route to a search index.
I tried to apply the Microsoft cognitive search service to one of mine work document folder, the folder is quite messy, with ppt presentations, images, PDF. It is very interesting for me to search among the various documents of different formats and understand the images related to certain topics and presence of entities such as people and organizations involved in the documents.
As first step I set up an Azure Blob service and loaded my data by Azure Storage Explorer, a very useful tool that run on your PC and it is able to connect your local data with cloud storage:
The enrichment pipeline of cognitive search pulls from my data sources and any data source supported by Azure search. Then I set up an Azure Search service and I connected to my blob storage:
the most relevant aspect is to set up the cognitive skill set to use during the search:
the cognitive skills enrich the indexing pipeline. The portal gives you predefined cognitive skills for image analysis and text analysis. In the portal, a skill-set operates over a single source field. You would like to extract the textual representation from files that are composed of mostly scanned images, like a PDF that gets generated by a scanner, the cognitive search service can automatically extract content from embedded images in the document. To do that, you can enable OCR option. This will automatically create a merged_content field that contains both the text extracted from the document as well as the textual representation of images embedded in the document.
My first search was about “stations” and I got the following row results (I’ve been content of JSON for my first experiment)
the result is very interesting, beyond the results of azure search, I have identified all the people mentioned and involved in the documents and all the organizations. Also in merged_content I had a list of all the images containing the “stations” and their description.
This information is the basic elements of a solution like that of “JFK files”. The cognitive search has been applied to the JFK files. Late last year the US government released more than 34,000 pages related to the assassination of JFK. The files consisted of mixture of typed and handwritten pages that were 60 years old which were scanned into PDF files as well as scanned evidence photos. It was incredible to see what emerged, it was possible to see the answers and relationships in context with the original documents, take a look here.
The cognitive search is a journey and the first steps have been made.