Watson how does it work




















Watson is a cognitive technology that processes information much more like a smart human than a smart computer. Rather than thinking humans will be replaced by a computer, you should realize that this is, in fact, a huge opportunity. In , you may recall, Watson summarily bested its human competitors on Jeopardy. No other technology on the market today possesses these combined capabilities.

Unlike typical computers, Watson can unlock the vast world of unstructured data that makes up as much as 80 percent of existing information today. Watson knows that all data is not created equal. It culls relevant data from disparate sources, and it creates hypotheses and continually tests them in order to narrow in on the most reliable and accurate results.

Because Watson can read, analyze, and learn from natural language, just as humans can, it can make the sorts of informed, context-specific decisions we would expect from a person, as opposed to a search engine. Such as… what to cook for dinner? Once it understood the principles of taste and cuisine style, as well as the intricate mechanics of flavor combinations — it was able to generate new, creative dishes. In primary search the goal is to find as much potentially answer-bearing content as possible based on the results of question analysis — the focus is squarely on recall with the expectation that the host of deeper content analytics will extract answer candidates and score this content plus whatever evidence can be found in support or refutation of candidates to drive up the precision.

Over the course of the project we continued to conduct empirical studies designed to balance speed, recall, and precision. These studies allowed us to regularly tune the system to find the number of search results and candidates that produced the best balance of accuracy and computational resources.

The operative goal for primary search eventually stabilized at about 85 percent binary recall for the top candidates; that is, the system generates the correct answer as a candidate answer for 85 percent of the questions somewhere within the top ranked candidates. A variety of search techniques are used, including the use of multiple text search engines with different underlying approaches for example, Indri and Lucene , document search as well as passage search, knowledge base search using SPARQL on triple stores, the generation of multiple search queries for a single question, and backfilling hit lists to satisfy key constraints identified in the question.

Triple store queries in primary search are based on named entities in the clue; for example, find all database entities related to the clue entities, or based on more focused queries in the cases that a semantic relation was detected.

The search results feed into candidate generation, where techniques appropriate to the kind of search results are applied to generate candidate answers.

The system may generate a number of candidate answer variants from the same title based on substring analysis or link analysis if the underlying source contains hyperlinks. Passage search results require more detailed analysis of the passage text to identify candidate answers. For example, named entity detection may be used to extract candidate answers from the passage. Some sources, such as a triple store and reverse dictionary lookup, produce candidate answers directly as their search result.

If the correct answer s are not generated at this stage as a candidate, the system has no hope of answering the question. This step therefore significantly favors recall over precision, with the expectation that the rest of the processing pipeline will tease out the correct answer, even if the set of candidates is quite large.

One of the goals of the system design, therefore, is to tolerate noise in the early stages of the pipeline and drive up precision downstream.

A key step in managing the resource versus precision trade-off is the application of lightweight less resource intensive scoring algorithms to a larger set of initial candidates to prune them down to a smaller set of candidates before the more intensive scoring components see them. For example, a lightweight scorer may compute the likelihood of a candidate answer being an instance of the LAT. We call this step soft filtering. The system combines these lightweight analysis scores into a soft filtering score.

Candidate answers that pass the soft filtering threshold proceed to hypothesis and evidence scoring, while those candidates that do not pass the filtering threshold are routed directly to the final merging stage.

The soft filtering scoring model and filtering threshold are determined based on machine learning over training data. Watson currently lets roughly candidates pass the soft filter, but this a parameterizable function.

Candidate answers that pass the soft filtering threshold undergo a rigorous evaluation process that involves gathering additional supporting evidence for each candidate answer, or hypothesis, and applying a wide variety of deep scoring analytics to evaluate the supporting evidence.

To better evaluate each candidate answer that passes the soft filter, the system gathers additional supporting evidence. The architecture supports the integration of a variety of evidence-gathering techniques.

One particularly effective technique is passage search where the candidate answer is added as a required term to the primary search query derived from the question. This will retrieve passages that contain the candidate answer used in the context of the original question terms. Supporting evidence may also come from other sources like triple stores. The retrieved supporting evidence is routed to the deep evidence scoring components, which evaluate the candidate answer in the context of the supporting evidence.

The scoring step is where the bulk of the deep content analysis is performed. Scoring algorithms determine the degree of certainty that retrieved evidence supports the candidate answers. The DeepQA framework supports and encourages the inclusion of many different components, or scorers, that consider different dimensions of the evidence and produce a score that corresponds to how well evidence supports a candidate answer for a given question.

DeepQA provides a common format for the scorers to register hypotheses for example candidate answers and confidence scores, while imposing few restrictions on the semantics of the scores themselves; this enables DeepQA developers to rapidly deploy, mix, and tune components to support each other.

For example, Watson employs more than 50 scoring components that produce scores ranging from formal probabilities to counts to categorical features, based on evidence from different types of sources including unstructured text, semistructured text, and triple stores. A third type of passage scoring measures the alignment of the logical forms of the question and passage. A logical form is a graphical abstraction of text in which nodes are terms in the text and edges represent either grammatical relationships for example, Hermjakob, Hovy, and Lin []; Moldovan et al.

The logical form alignment identifies Nixon as the object of the pardoning in the passage, and that the question is asking for the object of a pardoning.

Another type of scorer uses knowledge in triple stores, simple reasoning such as subsumption and disjointness in type taxonomies, geospatial, and temporal reasoning. Geospatial reasoning is used in Watson to detect the presence or absence of spatial relations such as directionality, borders, and containment between geoentities. For example, if a question asks for an Asian city, then spatial containment provides evidence that Beijing is a suitable candidate, whereas Sydney is not.

Similarly, geocoordinate information associated with entities is used to compute relative directionality for example, California is SW of Montana; GW Bridge is N of Lincoln Tunnel, and so on.

Temporal reasoning is used in Watson to detect inconsistencies between dates in the clue and those associated with a candidate answer. We cannot do this work justice here. It is important to note, however, at this point no one algorithm dominates.

To help developers and users get a sense of how Watson uses evidence to decide between competing candidate answers, scores are combined into an overall evidence profile. The evidence profile groups individual features into aggregate evidence dimensions that provide a more intuitive view of the feature group.

Each aggregate dimension is a combination of related feature scores produced by the specific algorithms that fired on the gathered evidence.

Consider the following question: Chile shares its longest land border with this country. In figure 8 we see a comparison of the evidence profiles for two candidate answers produced by the system for this question: Argentina and Bolivia. Simple search engine scores favor Bolivia as an answer, due to a popular border dispute that was frequently reported in the news.

Watson prefers Argentina the correct answer over Bolivia, and the evidence profile shows why. Although Bolivia does have strong popularity scores, Argentina has strong support in the geospatial, passage support for example, alignment and logical form graph matching of various text passages , and source reliability dimensions.

Figure 8. Evidence Profiles for Two Candidate Answers. Dimensions are on the x-axis and relative strength is on the y-axis. It is one thing to return documents that contain key words from the question. It is quite another, however, to analyze the question and the content enough to identify the precise answer and yet another to determine an accurate enough confidence in its correctness to bet on it.

Winning at Jeopardy requires exactly that ability. The goal of final ranking and merging is to evaluate the hundreds of hypotheses based on potentially hundreds of thousands of scores to identify the single best-supported hypothesis given the evidence and to estimate its confidence — the likelihood it is correct.

Multiple candidate answers for a question may be equivalent despite very different surface forms. This is particularly confusing to ranking techniques that make use of relative differences between candidates. Without merging, ranking algorithms would be comparing multiple surface forms that represent the same answer and trying to discriminate among them.

While one line of research has been proposed based on boosting confidence in similar candidates Ko, Nyberg, and Luo , our approach is inspired by the observation that different surface forms are often disparately supported in the evidence and result in radically different, though potentially complementary, scores. This motivates an approach that merges answer scores before ranking and confidence estimation. Using an ensemble of matching, normalization, and coreference resolution algorithms, Watson identifies equivalent and related hypotheses for example, Abraham Lincoln and Honest Abe and then enables custom merging per feature to combine scores.

After merging, the system must rank the hypotheses and estimate confidence based on their merged scores. We adopted a machine-learning approach that requires running the system over a set of training questions with known answers and training a model based on the scores.

One could assume a very flat model and apply existing ranking algorithms for example, Herbrich, Graepel, and Obermayer []; Joachims [] directly to these score profiles and use the ranking score for confidence.

For more intelligent ranking, however, ranking and confidence estimation may be separated into two phases. In both phases sets of scores may be grouped according to their domain for example type matching, passage scoring, and so on. Using these intermediate models, the system produces an ensemble of intermediate scores. Motivated by hierarchical techniques such as mixture of experts Jacobs et al. This approach allows for iteratively enhancing the system with more sophisticated and deeper hierarchical models while retaining flexibility for robustness and experimentation as scorers are modified and added to the system.

Finally, an important consideration in dealing with NLP-based scorers is that the features they produce may be quite sparse, and so accurate confidence estimation requires the application of confidence-weighted learning techniques.

Dredze, Crammer, and Pereira UIMA was designed to support interoperability and scaleout of text and multimodal analysis applications. These are software components that analyze text and produce annotations or assertions about the text.

Watson has evolved over time and the number of components in the system has reached into the hundreds. UIMA facilitated rapid component integration, testing, and evaluation.

Early implementations of Watson ran on a single processor where it took 2 hours to answer a single question. The DeepQA computation is embarrassing parallel, however.

To preprocess the corpus and create fast run-time indices we used Hadoop. Hadoop distributes the content over the cluster to afford high CPU utilization and provides convenient tools for deploying, managing, and monitoring the corpus analysis process. Jeopardy demands strategic game play to match wits against the best human players. In a typical Jeopardy game, Watson faces the following strategic decisions: deciding whether to buzz in and attempt to answer a question, selecting squares from the board, and wagering on Daily Doubles and Final Jeopardy.

The workhorse of strategic decisions is the buzz-in decision, which is required for every non—Daily Double clue on the board. After a Daily Double, however, the game is not over, so evaluation of a wager requires forecasting the effect it will have on the distant, final outcome of the game. These challenges drove the construction of statistical models of players and games, game-theoretic analyses of particular game scenarios and strategies, and the development and application of reinforcement-learning techniques for Watson to learn its strategy for playing Jeopardy.

Fortunately, moderate samounts of historical data are available to serve as training data for learning techniques. Even so, it requires extremely careful modeling and game-theoretic evaluation as the game of Jeopardy has incomplete information and uncertainty to model, critical score boundaries to recognize, and savvy, competitive players to account for.

It is a game where one faulty strategic choice can lose the entire match. After many nonstarters, by the fourth quarter of we finally adopted the DeepQA architecture. We instituted a host of disciplined engineering and experimental methodologies supported by metrics and tools to ensure we were investing in techniques that promised significant impact on end-to-end metrics.

Since then, modulo some early jumps in performance, the progress has been incremental but steady. It is slowing in recent months as the remaining challenges prove either very difficult or highly specialized and covering small phenomena in the data.

By the end of we were performing reasonably well — about 70 percent precision at 70 percent attempted over the 12, question blind data, but it was taking 2 hours to answer a single question on a single CPU. We are currently answering more than 85 percent of the questions in 5 seconds or less — fast enough to provide competitive performance, and with continued algorithmic development are performing with about 85 percent precision at 70 percent attempted.

We have more to do in order to improve precision, confidence, and speed enough to compete with grand champions. We are finding great results in leveraging the DeepQA architecture capability to quickly admit and evaluate the impact of new algorithms as we engage more university partnerships to help meet the challenge.

In particular, we added question-analysis components from PIQUANT and OpenEphyra that identify answer types for a question, and candidate answer-generation components that identify instances of those answer types in the text.

The DeepQA framework utilized both sets of components despite their different type systems — no ontology integration was performed. The identification and integration of these domain specific components into DeepQA took just a few weeks.

Figure 10 shows the results of the adaptation experiment. The DeepQA system at the time had accuracy above 50 percent on Jeopardy. We repeated the adaptation experiment in , and in addition to the improvements to DeepQA since , the adaptation included a transfer learning step for TREC questions from a model trained on Jeopardy questions.

The result performed significantly better than the original complete systems on the task for which they were designed. While just one adaptation experiment, this is exactly the sort of behavior we think an extensible QA system should exhibit. It should quickly absorb domain- or task-specific components and get better on that target task without degradation in performance in the general case or on prior tasks. After 3 years of intense research and development by a core team of about 20 researcherss, Watson is performing at human expert levels in terms of precision, confidence, and speed at the Jeopardy quiz show.

Our results strongly suggest that DeepQA is an effective and extensible architecture that may be used as a foundation for combining, deploying, evaluating, and advancing a wide range of algorithmic techniques to rapidly advance the field of QA.

The architecture and methodology developed as part of this project has highlighted the need to take a systems-level approach to research in QA, and we believe this applies to research in the broader field of AI. We have developed many different algorithms for addressing different kinds of problems in QA and plan to publish many of them in more detail in the future.

However, no one algorithm solves challenge problems like this. End-to-end systems tend to involve many complex and often overlapping interactions. A system design and methodology that facilitated the efficient integration and ablation studies of many probabilistic components was essential for our success to date.

The impact of any one algorithm on end-to-end performance changed over time as other techniques were added and had overlapping effects. Our commitment to regularly evaluate the effects of specific techniques on end-to-end performance, and to let that shape our research investment, was necessary for our rapid progress. Rapid experimentation was another critical ingredient to our success. The team conducted more than independent experiments in 3 years — each averaging about CPU hours and generating more than 10 GB of error-analysis data.

It is holding its own, winning 64 percent of the games, but has to be improved and sped up to compete favorably against the very best. We have leveraged our collaboration with CMU and with our other university partnerships in getting this far and hope to continue our collaborative work to drive Watson to its final goal, and help openly advance QA research.

We would like to acknowledge the talented team of research scientists and engineers at IBM and at partner universities, listed below, for the incredible work they are doing to influence and develop all aspects of Watson and the DeepQA architecture. It is this team who are responsible for the work described in this paper. The dip at the left end of the light gray curve is due to the disproportionately high score the search engine assigns to short queries, which typically are not sufficiently discriminative to retrieve the correct answer in top position.

Chu-Carroll, J. Dredze, M. Confidence-Weighted Linear Classification. Dupee, M. How to Get on Jeopardy! Secaucus, NJ: Citadel Press. Ferrucci, D. The good news for small and midsize companies is that they can use Watson, too; IBM offers a cloud-based version of Watson that companies can pay for by subscription or on demand. Companies that can afford an investment into multiple millions of dollars can purchase an in-house IBM Watson system, which consists of futile servers tethered together into a processing cluster.

For companies without these resources, Watson can be accessed through the IBM cloud. For example, IBM offers a software developer's cloud powered by Watson. It also provides a cloud-based global healthcare analytics cloud. Learn the latest news and best practices about data science, big data analytics, and artificial intelligence. Delivered Mondays. Mary E. Shacklett is president of Transworld Data, a technology research and market development firm. Watch Now. More about big data New Microsoft analytics tools help identify and understand trends without compromising privacy Are you ready for HAL?

NAO Robot is powered by Watson and uses its powerful natural language processing capabilities. Data, Analytics and AI Newsletter Learn the latest news and best practices about data science, big data analytics, and artificial intelligence. Delivered Mondays Sign up today. Editor's Picks. Linux kernel 5.



0コメント

  • 1000 / 1000