Our SCALE user conference is coming back in Fall 2020!
Subscribe to be the first to hear the news!

The OmBlog

Categories

“Any sufficiently advanced technology is indistinguishable from magic.” 

Artificial Intelligence is the future. If you follow the machine learning (ML) literature, new and incredible algorithms are developed daily and the results speak for themselves. The number of papers listed under the ML category on Arxiv (an open pre-print service) for the past 10 years has been growing at a rate faster than Moore’s law, with about ~90 papers published per day on ML (as of Oct. 2019).  A new Medium article about AI (I’m using AI and ML interchangeably) is written once every 4 minutes (according to my non-scientific methodologies). On any given day one can find a new GitHub repo detailing a new model that bests the score of the previously winning model. Companies and investors alike are pouring billions of dollars into AI technologies, as they realize that AI can help solve common and fundamental problems: predictive medicine, gaming, playing doctor, self-driving cars, protein folding, and some that don’t have clear utility/human benefit, to highlight a select few. Google is now infusing their AI model into the core search algorithm. The AI model, called BERT, is a neural network designed to understand natural language and Google claims will improve query intent in ~10% of searches. 

The realm of AI reminds me of Arthur C. Clarke’s observation that:

“Any sufficiently advanced technology is indistinguishable from magic.” 

Here at Ombud, we’re busy building the Ombrain – the AI behind our intelligent content collaboration platform. Ombrain is the machine learning-powered tool suite that makes Ombud’s users feel as though our platform is magical. We believe users should only focus on high-value activities, such as creating and reviewing new, quality content and offload the mundane, low-value but resource-intensive tasks to Ombrain. Ombrain is solving several real-world problems that every organization faces: surfacing the highest quality and most appropriate content when needed, understanding users’ domain expertise, document ingestion and suggesting improvements to content, to name a few. Building the Ombrain requires developing various deep learning models, including those that have an understanding of domain-specific language, user behavior and document structure. We are actively applying proven techniques in deep learning, such as transfer learning (i.e., standing on the shoulders of (typically tech) giants) to create a more magical product. 

One use case for Ombrain is information retrieval (IR). A tried-and-tested method for IR involves matching common text (typically full words) between the query and candidate documents in a corpus, where word frequency and rareness contribute to a more probable match - or higher score. This method is called tf-idf and is baked into powerful search engines like Solr and Elasticsearch. However, this methodology doesn’t take into account all of the data available (e.g., document importance a la PageRank) and cannot understand query semantics (e.g., we soccer-fanatics know Barcelona is at the top of the league and European soccer, especially Messi, is a joy to watch, semantically relate to one another). Enter transfer learning: open-source NLP models, such as Google’s Universal Sentence Encoder, have been trained on various text-rich sources such as Wikipedia, open question-answer forums, news sites and academic datasets, so it understands language semantics. The model converts words, sentences and paragraphs (pretty much any text you throw it) to a fixed-dimensional vector (512 in this case). We can write a simple python script to show where deep-learning based sentence encoding understands semantics in a way that tf-idf cannot. 


Figure 1. On a  0 to 1 scale, 0 representing no similarity and 1 representing perfect similarity (e.g., all sentences compared to themselves), we compute tf-idf (above) and semantic (below) similarities between 4 sentences. (Above) The keyword “especially” overlaps in sentence 2 and 3, giving a score of 0.11. Stop words are removed (e.g., the, is, at), so they do not contribute to the score, therefore, no other sentences have overlapping keywords and receive a score of 0. (Below) Semantic similarity is high between sentences 1 and 2 (0.48), as well as  3 and 4 (0.51), as the semantics of these pairs of sentences is more similar than compared to non- sentences, for example sentence 2 and 3 have almost no similarity (0.05). 


Ombrain uses modules like the Universal Sentence Encoder to encode more information per candidate document than simply a tf-idf based score. Using such information coupled with user behavior, Ombrain has learned how to re-rank the top search results provided by the search engine using a more resource intensive but accurate methodology. This popular approach to reranking search results is called Learning to Rank, and is one of Ombrain’s “magical” services. Stay tuned for more!

If this sounds like fun to you, please apply – we’re hiring.

More From The OmBlog:

Photo by Glenn Carstens-Peters on Unsplash

Get OmBlog updates first

Subscribe to our blog to get updates on new content when it comes out!
No spam!