{"id":12939,"date":"2023-08-19T01:54:22","date_gmt":"2023-08-19T01:54:22","guid":{"rendered":"https:\/\/isafespend.com\/news\/meta-openai-anthropic-and-cohere-a-i-models-all-make-stuff-up-heres-which-is-worst\/"},"modified":"2023-08-19T01:54:23","modified_gmt":"2023-08-19T01:54:23","slug":"meta-openai-anthropic-and-cohere-a-i-models-all-make-stuff-up-heres-which-is-worst","status":"publish","type":"post","link":"https:\/\/isafespend.com\/?p=12939","title":{"rendered":"Meta, OpenAI, Anthropic and Cohere A.I. models all make stuff up \u2014 here&#8217;s which is worst"},"content":{"rendered":"<div>\n<p>If the tech industry&#8217;s top AI models had superlatives, <span class=\"QuoteInBody-quoteNameContainer\" data-test=\"QuoteInBody\" id=\"RegularArticle-QuoteInBody-1\">Microsoft<span class=\"QuoteInBody-inlineButton\"><span class=\"AddToWatchlistButton-watchlistContainer\" id=\"-WatchlistDropdown\" data-analytics-id=\"-WatchlistDropdown\"><button class=\"AddToWatchlistButton-watchlistButton\" data-testid=\"dropdown-btn\"><span class=\"AddToWatchlistButton-addWatchListFromTag\"><\/span><\/button><\/span><\/span><\/span>-backed OpenAI&#8217;s GPT-4 would be best at math, <span class=\"QuoteInBody-quoteNameContainer\" data-test=\"QuoteInBody\" id=\"RegularArticle-QuoteInBody-2\">Meta<span class=\"QuoteInBody-inlineButton\"><span class=\"AddToWatchlistButton-watchlistContainer\" id=\"-WatchlistDropdown\" data-analytics-id=\"-WatchlistDropdown\"><button class=\"AddToWatchlistButton-watchlistButton\" data-testid=\"dropdown-btn\"><span class=\"AddToWatchlistButton-addWatchListFromTag\"><\/span><\/button><\/span><\/span><\/span>&#8216;s Llama 2 would be most middle of the road, Anthropic&#8217;s Claude 2 would be best at knowing its limits and Cohere AI would receive the title of most hallucinations \u2014 and most confident wrong answers.<\/p>\n<p>That&#8217;s all according to a Thursday report from researchers at Arthur AI, a machine learning monitoring platform.<\/p>\n<p>The research comes at a time when misinformation stemming from artificial intelligence systems is more hotly debated than ever, amid a boom in generative AI ahead of the 2024 U.S. presidential election.<\/p>\n<p>It&#8217;s the first report &#8220;to take a comprehensive look at rates of hallucination, rather than just sort of &#8230; provide a single number that talks about where they are on an LLM leaderboard,&#8221; Adam Wenchel, co-founder and CEO of Arthur, told CNBC.<\/p>\n<p>AI hallucinations occur when large language models, or LLMs, fabricate information entirely, behaving as if they are spouting facts. One example: In June, news broke that ChatGPT\u00a0cited &#8220;bogus&#8221; cases\u00a0in a New York federal court filing, and the New York attorneys involved may face sanctions.\u00a0<\/p>\n<p>In one experiment, the Arthur AI researchers tested the AI models in categories such as combinatorial mathematics, U.S. presidents and Moroccan political leaders, asking questions &#8220;designed to contain a key ingredient that gets LLMs to blunder: they demand multiple steps of reasoning about information,&#8221; the researchers wrote.<\/p>\n<p>Overall, OpenAI&#8217;s GPT-4 performed the best of all models tested, and researchers found it hallucinated less than its prior version, GPT-3.5 \u2014 for example, on math questions, it hallucinated between 33% and 50% less. depending on the category.<\/p>\n<p>Meta&#8217;s Llama 2, on the other hand, hallucinates more overall than GPT-4 and Anthropic&#8217;s Claude 2, researchers found.<\/p>\n<p>In the math category, GPT-4 came in first place, followed closely by Claude 2, but in U.S. presidents, Claude 2 took the first place spot for accuracy, bumping GPT-4 to second place. When asked about Moroccan politics, GPT-4 came in first again, and Claude 2 and Llama 2 almost entirely chose not to answer.<\/p>\n<p>In a second experiment, the researchers tested how much the AI models would hedge their answers with warning phrases to avoid risk (think: &#8220;As an AI model, I cannot provide opinions&#8221;).<\/p>\n<p>When it comes to hedging, GPT-4 had a 50% relative increase compared to GPT-3.5, which &#8220;quantifies anecdotal evidence from users that GPT-4 is more frustrating to use,&#8221; the researchers wrote. Cohere&#8217;s AI model, on the other hand, did not hedge at all in any of its responses, according to the report. Claude 2 was most reliable in terms of &#8220;self-awareness,&#8221; the research showed, meaning accurately gauging what it does and doesn&#8217;t know, and answering only questions it had training data to support.<\/p>\n<p>A spokesperson for Cohere pushed back on the results, saying, &#8220;Cohere&#8217;s retrieval augmented generation technology, which was not in the model tested, is highly effective at giving enterprises verifiable citations to confirm sources of information.&#8221;<\/p>\n<p>The most important takeaway for users and businesses, Wenchel said, was to &#8220;test on your exact workload,&#8221; later adding, &#8220;It&#8217;s important to understand how it performs for what you&#8217;re trying to accomplish.&#8221;<\/p>\n<p>&#8220;A lot of the benchmarks are just looking at some measure of the LLM by itself, but that&#8217;s not actually the way it&#8217;s getting used in the real world,&#8221; Wenchel said. &#8220;Making sure you really understand the way the LLM performs for the way it&#8217;s actually getting used is the key.&#8221;<\/p>\n<\/div>\n<p>Read the full article <a href=\"https:\/\/www.cnbc.com\/2023\/08\/17\/which-ai-is-most-reliable-meta-openai-anthropic-or-cohere.html\" target=\"_blank\" rel=\"noopener\">here<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>If the tech industry&#8217;s top AI models had superlatives, Microsoft-backed OpenAI&#8217;s GPT-4 would be best at math, Meta&#8216;s Llama 2 would be most middle of the road, Anthropic&#8217;s Claude 2 would be best at knowing its limits and Cohere AI would receive the title of most hallucinations \u2014 and most confident wrong answers. That&#8217;s all<\/p>\n","protected":false},"author":1,"featured_media":12940,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[43],"tags":[],"class_list":{"0":"post-12939","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-news"},"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v20.12 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Meta, OpenAI, Anthropic and Cohere A.I. models all make stuff up \u2014 here&#039;s which is worst | iSafeSpend<\/title>\n<meta name=\"description\" content=\"If the tech industry&#039;s top AI models had superlatives, Microsoft-backed OpenAI&#039;s GPT-4 would be best at math, Meta&#039;s Llama 2 would be most middle of the\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/isafespend.com\/?p=12939\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Meta, OpenAI, Anthropic and Cohere A.I. models all make stuff up \u2014 here&#039;s which is worst | iSafeSpend\" \/>\n<meta property=\"og:description\" content=\"If the tech industry&#039;s top AI models had superlatives, Microsoft-backed OpenAI&#039;s GPT-4 would be best at math, Meta&#039;s Llama 2 would be most middle of the\" \/>\n<meta property=\"og:url\" content=\"https:\/\/isafespend.com\/?p=12939\" \/>\n<meta property=\"og:site_name\" content=\"iSafeSpend\" \/>\n<meta property=\"article:published_time\" content=\"2023-08-19T01:54:22+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-08-19T01:54:23+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/isafespend.com\/wp-content\/uploads\/2023\/08\/107269941-1689109722155-gettyimages-1250799402-RAFAPRESS_05042023-05789.jpeg\" \/>\n\t<meta property=\"og:image:width\" content=\"1920\" \/>\n\t<meta property=\"og:image:height\" content=\"1080\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/jpeg\" \/>\n<meta name=\"author\" content=\"News Room\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"News Room\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"3 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/isafespend.com\/?p=12939#article\",\"isPartOf\":{\"@id\":\"https:\/\/isafespend.com\/?p=12939\"},\"author\":{\"name\":\"News Room\",\"@id\":\"https:\/\/isafespend.com\/#\/schema\/person\/5b8c1c75336efaf09b163cd1eab0c9bf\"},\"headline\":\"Meta, OpenAI, Anthropic and Cohere A.I. models all make stuff up \u2014 here&#8217;s which is worst\",\"datePublished\":\"2023-08-19T01:54:22+00:00\",\"dateModified\":\"2023-08-19T01:54:23+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/isafespend.com\/?p=12939\"},\"wordCount\":605,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/isafespend.com\/#organization\"},\"articleSection\":[\"News\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/isafespend.com\/?p=12939#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/isafespend.com\/?p=12939\",\"url\":\"https:\/\/isafespend.com\/?p=12939\",\"name\":\"Meta, OpenAI, Anthropic and Cohere A.I. models all make stuff up \u2014 here's which is worst | iSafeSpend\",\"isPartOf\":{\"@id\":\"https:\/\/isafespend.com\/#website\"},\"datePublished\":\"2023-08-19T01:54:22+00:00\",\"dateModified\":\"2023-08-19T01:54:23+00:00\",\"description\":\"If the tech industry's top AI models had superlatives, Microsoft-backed OpenAI's GPT-4 would be best at math, Meta's Llama 2 would be most middle of the\",\"breadcrumb\":{\"@id\":\"https:\/\/isafespend.com\/?p=12939#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/isafespend.com\/?p=12939\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/isafespend.com\/?p=12939#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/isafespend.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Meta, OpenAI, Anthropic and Cohere A.I. models all make stuff up \u2014 here&#8217;s which is worst\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/isafespend.com\/#website\",\"url\":\"https:\/\/isafespend.com\/\",\"name\":\"Solutions For Real\",\"description\":\"Latest Finance News and Updates\",\"publisher\":{\"@id\":\"https:\/\/isafespend.com\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/isafespend.com\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/isafespend.com\/#organization\",\"name\":\"Solutions For Real\",\"url\":\"https:\/\/isafespend.com\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/isafespend.com\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/isafespend.com\/wp-content\/uploads\/2023\/08\/sf-logo-1.png\",\"contentUrl\":\"https:\/\/isafespend.com\/wp-content\/uploads\/2023\/08\/sf-logo-1.png\",\"width\":690,\"height\":64,\"caption\":\"Solutions For Real\"},\"image\":{\"@id\":\"https:\/\/isafespend.com\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/isafespend.com\/#\/schema\/person\/5b8c1c75336efaf09b163cd1eab0c9bf\",\"name\":\"News Room\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/isafespend.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/isafespend.com\/wp-content\/uploads\/2023\/08\/avatar_user_1_1691264579-96x96.png\",\"contentUrl\":\"https:\/\/isafespend.com\/wp-content\/uploads\/2023\/08\/avatar_user_1_1691264579-96x96.png\",\"caption\":\"News Room\"},\"sameAs\":[\"https:\/\/isafespend.com\"],\"url\":\"https:\/\/isafespend.com\/?author=1\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Meta, OpenAI, Anthropic and Cohere A.I. models all make stuff up \u2014 here's which is worst | iSafeSpend","description":"If the tech industry's top AI models had superlatives, Microsoft-backed OpenAI's GPT-4 would be best at math, Meta's Llama 2 would be most middle of the","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/isafespend.com\/?p=12939","og_locale":"en_US","og_type":"article","og_title":"Meta, OpenAI, Anthropic and Cohere A.I. models all make stuff up \u2014 here's which is worst | iSafeSpend","og_description":"If the tech industry's top AI models had superlatives, Microsoft-backed OpenAI's GPT-4 would be best at math, Meta's Llama 2 would be most middle of the","og_url":"https:\/\/isafespend.com\/?p=12939","og_site_name":"iSafeSpend","article_published_time":"2023-08-19T01:54:22+00:00","article_modified_time":"2023-08-19T01:54:23+00:00","og_image":[{"width":1920,"height":1080,"url":"https:\/\/isafespend.com\/wp-content\/uploads\/2023\/08\/107269941-1689109722155-gettyimages-1250799402-RAFAPRESS_05042023-05789.jpeg","type":"image\/jpeg"}],"author":"News Room","twitter_card":"summary_large_image","twitter_misc":{"Written by":"News Room","Est. reading time":"3 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/isafespend.com\/?p=12939#article","isPartOf":{"@id":"https:\/\/isafespend.com\/?p=12939"},"author":{"name":"News Room","@id":"https:\/\/isafespend.com\/#\/schema\/person\/5b8c1c75336efaf09b163cd1eab0c9bf"},"headline":"Meta, OpenAI, Anthropic and Cohere A.I. models all make stuff up \u2014 here&#8217;s which is worst","datePublished":"2023-08-19T01:54:22+00:00","dateModified":"2023-08-19T01:54:23+00:00","mainEntityOfPage":{"@id":"https:\/\/isafespend.com\/?p=12939"},"wordCount":605,"commentCount":0,"publisher":{"@id":"https:\/\/isafespend.com\/#organization"},"articleSection":["News"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/isafespend.com\/?p=12939#respond"]}]},{"@type":"WebPage","@id":"https:\/\/isafespend.com\/?p=12939","url":"https:\/\/isafespend.com\/?p=12939","name":"Meta, OpenAI, Anthropic and Cohere A.I. models all make stuff up \u2014 here's which is worst | iSafeSpend","isPartOf":{"@id":"https:\/\/isafespend.com\/#website"},"datePublished":"2023-08-19T01:54:22+00:00","dateModified":"2023-08-19T01:54:23+00:00","description":"If the tech industry's top AI models had superlatives, Microsoft-backed OpenAI's GPT-4 would be best at math, Meta's Llama 2 would be most middle of the","breadcrumb":{"@id":"https:\/\/isafespend.com\/?p=12939#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/isafespend.com\/?p=12939"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/isafespend.com\/?p=12939#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/isafespend.com\/"},{"@type":"ListItem","position":2,"name":"Meta, OpenAI, Anthropic and Cohere A.I. models all make stuff up \u2014 here&#8217;s which is worst"}]},{"@type":"WebSite","@id":"https:\/\/isafespend.com\/#website","url":"https:\/\/isafespend.com\/","name":"Solutions For Real","description":"Latest Finance News and Updates","publisher":{"@id":"https:\/\/isafespend.com\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/isafespend.com\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/isafespend.com\/#organization","name":"Solutions For Real","url":"https:\/\/isafespend.com\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/isafespend.com\/#\/schema\/logo\/image\/","url":"https:\/\/isafespend.com\/wp-content\/uploads\/2023\/08\/sf-logo-1.png","contentUrl":"https:\/\/isafespend.com\/wp-content\/uploads\/2023\/08\/sf-logo-1.png","width":690,"height":64,"caption":"Solutions For Real"},"image":{"@id":"https:\/\/isafespend.com\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/isafespend.com\/#\/schema\/person\/5b8c1c75336efaf09b163cd1eab0c9bf","name":"News Room","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/isafespend.com\/#\/schema\/person\/image\/","url":"https:\/\/isafespend.com\/wp-content\/uploads\/2023\/08\/avatar_user_1_1691264579-96x96.png","contentUrl":"https:\/\/isafespend.com\/wp-content\/uploads\/2023\/08\/avatar_user_1_1691264579-96x96.png","caption":"News Room"},"sameAs":["https:\/\/isafespend.com"],"url":"https:\/\/isafespend.com\/?author=1"}]}},"_links":{"self":[{"href":"https:\/\/isafespend.com\/index.php?rest_route=\/wp\/v2\/posts\/12939","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/isafespend.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/isafespend.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/isafespend.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/isafespend.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=12939"}],"version-history":[{"count":1,"href":"https:\/\/isafespend.com\/index.php?rest_route=\/wp\/v2\/posts\/12939\/revisions"}],"predecessor-version":[{"id":12941,"href":"https:\/\/isafespend.com\/index.php?rest_route=\/wp\/v2\/posts\/12939\/revisions\/12941"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/isafespend.com\/index.php?rest_route=\/wp\/v2\/media\/12940"}],"wp:attachment":[{"href":"https:\/\/isafespend.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=12939"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/isafespend.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=12939"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/isafespend.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=12939"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}