Deploying Large Language Models with Ease: Lessons shared at AI in Production 2024

Deploying Large Language Models with Ease: Lessons Shared at AI in Production 2024

Digits attended the AI in Production 2024 Conference held in Asheville, North Carolina and shared our experience deploying Open Source Large Language Models (LLMs).

A few days ago, the CEO of Huggingface shared an astounding claim: open-source models are catching up fast to closed-source ones like OpenAI’s GPT-4. This is a major win for the open-source community and shows how AI is becoming more accessible to everyone. But, it also points out an important reality: using these advanced open-source models isn't without its challenges.

In my experience, the accuracy gap between open-source and proprietary is negligible now and open-source is cheaper, faster, more customizable & sustainable for companies! No excuse anymore not to become an AI builder based on open-source AI (vs outsourcing to APIs)!
— Clement Delangue (@ClementDelangue) December 16, 2021

Understanding the Landscape

To set the stage, let's first revisit why deploying open-source LLMs is worthwhile in the first place. The growing proficiency of these models offers undeniable advantages, primarily data privacy and customization.

Why Deploying Open-Source LLMs

Data Privacy: Using open-source models means that customer data doesn’t have to be shared with third-party services. This self-hosted approach ensures heightened security and compliance with stringent data protection regulations.
Fine-tuning: Open source models allow for fine-tuning to cater to specific use cases. This customization isn’t typically possible with closed-source models, which often only allow for the usage of pre-trained weights without modifications.

Now that the 'why' behind deploying open-source LLMs is clear, let’s delve into the 'how'. Here's what Digits learned from deploying LLMs, summarized into actionable insights.

Key Lessons from Digits' Deployment Journey

1. Model Selection: Optimize for One, Not Many

In the world of machine learning, it's easy to get swept up in the hype of the latest models. Focus and specialization pay off better in the long run. We found it crucial to select a single model that aligns well with our needs and optimize our infrastructure around it. This focused approach allows for deeper integration and better performance.

2. Tooling and Infrastructure

When we initially embarked on this journey, we assumed that existing hyperscaler Machine Learning platforms would seamlessly support LLM deployments. We couldn’t have been more mistaken. We encountered various issues that significantly hampered our deployment efforts.

Choosing the Right Deployment Tooling: We eventually settled on Titan ML for our deployment needs, finding it to be an excellent combination between deployment flexibility and support by Titan ML.

3. Inference Optimizations

Making LLMs work in a production environment required us to address performance bottlenecks proactively. Here are the techniques that made a substantial difference:

Parallelization: Utilizing multiple GPUs can dramatically speed up inference times but it comes at a cost. Our expenses increased between 2-5x. Therefore, it's essential to evaluate if this trade-off works for your specific use case.

Efficient Attention (KV-Caching): Key-Value Caching can substantially improve inference speeds by storing previous results and reusing them. However, it’s vital to manage the memory usage intricately to avoid any performance degradation.

Quantization: This technique reduces the size and computational requirements of models by decreasing the precision of the numerical weights. We saw up to 90% reduction in inference costs, but it's highly problem-dependent. In some scenarios, the accuracy trade-offs might not be worth the cost savings.

Continuous Batching: This method involves grouping multiple inference requests together, thereby reducing the total number of computational steps. We saw a 2x speedup in inference time and hope for even higher speedups as this technology matures.

Putting It All Together: Our Use Cases

Digits employs Open Source LLMs for various applications, including understanding documents and converting unstructured data into structured data. Each use case brings its own set of challenges and optimization opportunities.

For instance, in document understanding, the quality of the input tokens and the pre-processing steps dramatically affect the accuracy and performance of the model.

Infrastructure Lessons

While the nuances of model selection and inference optimization are critical, don't overlook the mundane aspects of infrastructure, which can make or break the success of your deployment. Make sure you have the right backend systems in place that are resilient and scalable. Equally crucial is to employ monitoring and alerting systems that allow for rapid identification and resolution of any issues.

Conclusion and Next Steps

Deploying Large Language Models in production environments isn’t straightforward yet, but it is becoming increasingly simpler just 18 months ago. Our experience has shown that with the right model, tools, and optimizations, it is possible to deploy these models efficiently.

For those interested in diving deeper into our strategies and lessons learned, we’ve made our full presentation slides available below. They provide a more detailed breakdown of each technique and offer further insights into our deployment journey.

By sharing our experiences, we hope to contribute to the collective wisdom in the ML community, encouraging more organizations to harness the power of open-source LLMs safely and effectively. Happy deploying!

If you haven't already, check out our AI in Production 2024 Conference recap to learn more about the latest trends and insights from the event.

Digits Presented Lessons Learned from Deploying Large Language Models

Tucked away in the scenic splendor of Asheville, North Carolina, the AI in Production conference was held on July 19, 2024, bringing together some of the brightest minds in machine learning and artificial intelligence. The intimate gathering of 85 participants included ML experts and AI enthusiasts from renowned tech giants like Tesla, Intuit, Ramp, GitHub, and Digits. The focus of this event was clear: to delve deep into the real-world, production-level applications of machine learning, with a keen interest in the burgeoning field of Large Language Models (LLMs).

Unlike many conferences plagued by buzzwords and marketing fluff, AI in Production distinguished itself by being refreshingly genuine. The sessions were dedicated to practical machine learning solutions, steering clear of hype-driven narratives. It was a platform where every presenter showcased solutions to their problems, fostering an environment of learning, problem-solving, and genuine curiosity.

Unearthing Practical Solutions

The conference's essence lay in its authenticity. Each presentation was rooted in real-world experiences, offering attendees valuable insights into the successes and challenges of deploying large language models in production environments. Here are three presentations that stood out to Digits’ ML team.

JAX vs. PyTorch: A Comparative Perspective by Tesla

Sujay S Kumar, a Tesla engineer, delivered one of the standout talks and embarked on a detailed comparison between JAX and PyTorch, two leading frameworks in the machine learning landscape. The session was rich with technical insights, shedding light on the nuances of each framework. The engineer discussed the strengths and weaknesses of JAX in terms of flexibility and performance optimization, juxtaposed with PyTorch’s robust ecosystem and user-friendly interface. This comparative analysis equipped the audience with the knowledge to make informed decisions depending on their specific project needs and infrastructural considerations.

Ethical Implementation: Insights from GitHub

The senior Engineering Director, Christina Entcheva from GitHub, led a thought-provoking session emphasizing the ethical dimensions of deploying large language models. As these models become increasingly pervasive, the ethical implications surrounding data privacy, algorithmic bias, and societal impact are paramount. Christina fostered a sense or urgency regarding model bias and ensuring AI systems' fairness, transparency, and accountability. This talk was a timely reminder of the importance of ethical considerations, resonating deeply with the audience and sparking meaningful discussions on integrating ethical practices into everyday AI operations.

Tackling Data Curation: Best Practices from Dendra Systems’ Senior Data Scientist

Another illuminating talk was given by Richard Decal, a Senior Data Scientist from Dendra Systems, who deep dived into the challenges of curating production data sets once a machine learning model is deployed. The talk focused on the data drift phenomenon—where the target variable's statistical properties change over time, rendering the model less accurate. The data scientist shared valuable strategies for monitoring and mitigating data drift, emphasizing the importance of continuous evaluation and adjustment of data pipelines to maintain model performance. This session was particularly beneficial for practitioners seeking to enhance the robustness and reliability of their deployed models.

Beyond the Sessions: Meaningful Interactions

One of the conference's highlights was the vibrant conversations during and after the sessions. The atmosphere was charged with intellectual curiosity as participants explored the latest trends in machine learning. The intimate setting facilitated meaningful networking opportunities, allowing attendees to connect more personally.

The conference's practical focus enabled attendees to go beyond theoretical knowledge and engage with tangible solutions. Every participant left the conference with actionable takeaways, ready to implement the insights gained into their projects.

A Glimpse into the Future: Anticipation for Next Year

The resounding success of the AI in Production conference has left participants eagerly anticipating the next edition. Witnessing the cutting-edge advancements in large language models and their practical applications was a rare opportunity. The conference's genuine, no-nonsense approach ensured that every moment spent was valuable, fostering a culture of learning, innovation, and ethical AI deployment.

A Heartfelt Thank You

A significant part of the conference’s success can be attributed to the meticulous planning and execution by Julio Baros and the dedicated team of volunteers. Their efforts in organizing and facilitating the event were commendable, ensuring a seamless experience for all attendees. The invitation extended to Digit’s ML team was greatly appreciated, and the team was honored to be part of such a prestigious gathering.

Conclusion

The AI in Production conference in Asheville, NC, on July 19, 2024, was more than just an event; it was a confluence of brilliant minds, innovative solutions, and forward-thinking discussions. Focusing on practical, production-level applications of large language models provided attendees with a treasure trove of knowledge and insights. The captivating presentations, engaging conversations, and the stunning backdrop of Asheville made it an unforgettable experience.

We at Digits are already looking forward to returning next year, eager to continue the journey of learning and discovery in the ever-evolving landscape of machine learning and artificial intelligence.

In the meantime, let’s carry forward the lessons learned, implement the best practices shared, and strive for excellence in deploying large language models. A big thank you once again to Julio Baros and all the conference volunteers for making this event a remarkable success. Here’s to pushing the boundaries of AI and machine learning, one practical solution at a time.

Interested in learning more about Digits' presentation at the AI in Production conference? Check out our full presentation slides for an in-depth look at our journey with large language models.

Digits at Google I/O’24

Digits at Google I/O'24: A Fusion of Innovation and Collaboration

The Google I/O and the Google Developer Conference held in Mountain View, California, have always been a beacon of new technology and innovation, and 2024 was no exception. Like last year, Digits had the privilege of being invited to participate in this global gathering of ML/AI experts. Our team of engineers was thrilled and honored to be a part of such a dynamic and forward-thinking event.

Engaging with the Developers Advisory Board

One of the key highlights for us was participating in Google’s Developer Advisory Board meeting. This not only provided us with a platform to share our insights but also allowed us to exchange ideas with Google's Developer X group and learn about upcoming products.

A Closer Look at Google’s Innovations

From Digits' perspective, several announcements and tools stood out, each promising to significantly impact our journey with machine learning and artificial intelligence. Here’s a rundown of the highlights:

Gemma 2: A Leap Forward for Open Source LLMs

Google unveiled Gemma 2, a new model designed to enhance the capabilities of open-source large language models (LLMs). What makes Gemma 2 truly remarkable is its optimization for specific instance types, which will help reduce costs and improve hardware utilization. This is a significant advancement, as it enables more efficient and cost-effective deployment of ML models, a crucial factor for any tech-driven company.

Responsible Generative AI Toolkit

Another noteworthy introduction was Google's Responsible Generative AI Toolkit. This comprehensive toolkit provides resources to apply best practices for responsible use of open models like the Gemma series. It includes:

Guidance on Setting Safety Policies: Frameworks and guidelines for establishing robust safety policies when deploying AI models.
Safety Tuning and Classifiers: Tools for fine-tuning safety mechanisms to ensure that AI behaves as intended.
Model Evaluation: Metrics and methodologies for thorough evaluation of model safety.
Learning Interpretability Tool (LIT): This tool enables developers to investigate the behavior of models like Gemma and address potential issues. It offers a deeper understanding of how models make decisions, which is crucial for transparency and trustworthiness.
Methodology for Building Robust Safety Classifiers: Techniques to develop effective safety classifiers even with minimal examples, ensuring that AI systems can operate reliably in diverse scenarios.

LLM Comparator: A Visualization Tool for Model Comparison

The LLM Comparator is another brilliant tool that grabbed our attention. It is an interactive visualization instrument designed to analyze LLM evaluation results side-by-side. This tool facilitates qualitative analysis of how responses from two models differ, both at example- and slice-levels. For engineers and developers, this means more insightful comparisons and a stronger ability to refine and improve their models.

Reflecting on Our Experience

Being invited to Google I/O once again, especially being part of the Developer Advisory Board meeting for the second consecutive year, is a testament to the growing partnership and mutual respect between Digits and Google. We are thankful for this opportunity and excited about the collaborations and advancements that will emerge from these engagements.

Our time at Google I/O’24 was not only inspiring but also a powerful reminder of the incredible pace at which technology evolves. With tools like Gemma 2, the Responsible Generative AI Toolkit, and the LLM Comparator, we are on the brink of a new era in AI and ML development. At Digits, we look forward to integrating these innovations into our work and harnessing their potential to create transformative solutions.

Big thanks goes out to the Jeanine Banks and the entire Google team for hosting us at the Google Developer Advisory Board meeting.

^* Image credits: Google

University of Washington Lecture on GenAI in Finance

In April, Digits' expert machine-learning team was invited to conduct a lecture at the University of Washington. The event occurred at the Foster School of Business and was attended by a mixed crowd of students and faculty alike.

75 students flocked to the lecture, demonstrating the growing interest in these ground-breaking technologies, such as machine learning, that are paving future paths in finance. Undeniably, the turnout indicated the growing curiosity about practical applications of machine learning in the world of finance.

The lecture provided an overview of machine learning and Generative AI (GenAI) and explored their impacts in the finance sector. Attendees delved deep into understanding GenAI's specific use cases in finance, with our team sharing their exhaustive research findings and experienced insights to provide a wider perspective of GenAI's potential role in revolutionizing traditional accounting methods.

The University of Washington's proactive approach in inviting the Digits team and the hearty attendance underlines the increasing investment and gravitation towards AI technologies in finance. This trend is expected to continue as technology continues to weave its way into the world of finance.

In case you missed it, you can access the lecture slides below to help better understand this technological revolution.

Digits at Google Next’24

We're excited to share the highlights from our recent participation at Google Next’24 on April 9 and 10, where we showcased Digits at the NVIDIA booth. This event provided us with an unparalleled platform to demonstrate our cutting-edge machine learning models, which included the first in the world to handle double-entry accounting effectively. This is a product of our robust partnership with NVIDIA, which we are happy to highlight today.

Our collaboration with NVIDIA, a leading powerhouse in GPU technology, has been instrumental in powering Digits' machine learning initiatives. With NVIDIA's support and vast tech resources, we have been able to build a state-of-the-art, secure, and private machine-learning infrastructure that has revolutionized the way we handle our double-entry accounting system. This partnership signifies an important milestone in our journey of harnessing machine learning to solve real-world business problems.

Show casing Digits at the NVIDIA booth at Google Next'24

Our sessions at the NVIDIA booth offered us a unique opportunity to meet and engage with our current customers. It was a privilege to demonstrate how Digits supports startup founders by simplifying their financial processes and helping them understand their financial health. Feedback from customers during these sessions reaffirmed the benefits of our solutions in assisting startups in managing their finances with greater ease.

In addition to showcasing our technology, Google Next’24 was a fantastic opportunity for us to connect with Google experts. These interactions enabled us to gain valuable insights and learnings that we hope to incorporate into our future projects.

We are also excited to dive deep into state-of-the-art open-source machine learning projects at Google Next, like Gemma and JAX. These tools hold significant potential. Stay tuned as we will share more details on this in our upcoming blog posts.

In conclusion, our participation at Google Next’24 reinforced some of our fundamental beliefs - that collaboration fuels innovation, direct customer engagement is invaluable, and continued learning and exploration is a powerful tool for growth. We remain committed to leveraging the potential of machine learning to simplify business finances and believe that with partners like NVIDIA and platforms like Google Cloud, we are well on our path.

A special shout out is due to Michael Thompson, Bailey Blake, Matthew Varacalli, and Martha Aparicio from NVIDIA for this tremendous opportunity. We are already looking forward to next year's event.

Digits at Google Next 23

Every year, Google invites customers and major product partners to their Cloud conference, Google Next. After a multi-year in-person hiatus, Google Next returned in full force to San Francisco’s Moscone Center, and Digits was invited to present how we’ve collaborated with teams at Google to create Digits AI.

Given our experience with Vertex AI across many ML projects at Digits, presenting at Next provided a unique opportunity to showcase how we have been working to push finance and accounting software forward, and also share our experiences in developing machine learning and AI using Google Cloud products.

🤖 Getting Early Access

In the weeks leading up to the conference, our engineering team received early and exclusive access to Google Cloud’s latest release of their Vertex Python SDK. This allows remote execution of machine learning model training or model analysis, all controlled via a local Jupyter notebook. In the coming weeks, we’ll share a more in-depth post, with detailed explanations and feedback on our experience using the new product. But for now, we’ve included a summary of our initial findings as well as a video of our talk at Google Next where we discussed our experiences.

Initial Learnings

Vertex AI has been a fundamental element in building lean machine learning projects here at Digits. We’ve outlined some of the various use cases which were also discussed in more detail during our Next talk:

Vertex Pipelines → Any machine learning model in production is trained, evaluated and registered via CI-driven ML pipelines.
Vertex Metadata Store → During the model training, any produced pipeline artifact (e.g. the training set, or the preprocessed training data is archived through the metadata store).
Vertex Model Registry → Any positively evaluated, trained machine learning model produced by our machine learning pipelines is registered in a one-stop shop for future consumption.
Vertex Online Prediction Endpoints → Data pipelines or backend APIs can access the machine learning models through batch processes or online prediction endpoints.
Vertex Matching Streaming Enginex → Generated embeddings are made available through the embedding database service in Vertex, called matching engine.

Presenting at Google Next is an experience that outlines the true value of sharing information and learning from others in the industry. This event gave us a platform to share our knowledge with other customers and offer insights into our work and, conversely, we were privileged enough to glean wisdom from some of the industry’s most respected leaders in AI/ML as they shared their experiences and successes using Google products.

A special shout out is due to Sara Robinson, Chris Cho, Melanie Ratchford, and Esther Kim for this tremendous opportunity. We are already looking forward to next year's event in Las Vegas.

ML Engineering in the Time of GPT-4 & PaLM 2

Digits engineers recently spoke at Google's North America Connect conference on the future of machine learning. This blog post expands on the presentation themes.

Over the past few months, we have witnessed groundbreaking developments in the field of generative machine learning (ML) models, revolutionizing the potential impact ML can have across diverse industries. Today, machine learning projects can be integrated with various applications in just a matter of hours, as opposed to the days or even weeks it took in the past. This not only saves valuable time, but also empowers companies to embrace technological advancements and drive innovation to market quickly.

As we attempt to understand the power of this rapidly evolving domain, we feel compelled to share our thoughts on the future of machine learning. Through this blog post, we aim to:

Dissect the intricacies of the field
Delve into the multifaceted aspects of generative machine learning via model APIs like OpenAI
Discuss the benefits and downsides that have the potential to transform lives of people around the world.

Has Machine Learning Found Its Gutenberg Moment?

When we think of history's greatest technological leaps, the invention of the printing press in 1450 by Johannes Gutenberg in Mainz, Germany, is undoubtedly one of the most transformative. Gutenberg's press revolutionized how books were copied and distributed, no longer requiring them to be painstakingly hand-written by monks.

This innovation significantly altered access to knowledge, becoming one of the cornerstones in history and leading to increased literacy and widespread access to information. The “Gutenberg Moment.”

Are we experiencing a similar revolution in machine learning, specifically within the realm of generative AI?

Similar to how the Gutenberg Moment democratized access to information, the recent acceleration in access to generative AI has empowered businesses to swiftly adopt previously inaccessible technology such as Large Language Models (LLMs) and foster innovation, moving the autonomy to work with ML outside the confines of large technology companies and closer to domain experts in various industries.

As generative models continue to evolve, it begs the question: Will this evolution redefine the core tasks machine learning engineers are performing? Instead of focusing on generating datasets, training and evaluating machine learning models, will we shift focus to engineering prompts for LLMs?

Early Lessons Learned

When we first interacted with large language models, we were in awe of the generated human-like text. However, drawing conclusions based on brief interactions with these models can be misleading. It's essential to be cautious of initial outcomes, as LLMs are capable of producing highly convincing "hallucinations" or fabricated information within their output.

Moreover, LLMs may generate inconsistent outputs, reinforcing the need for human review in employing them effectively. As we continue to explore the potential of generative AI, understanding and mitigating these limitations will foster progress and unlock more reliable and robust applications.

Is Machine Learning Commoditized?

For certain projects, machine learning indeed appears to be commoditized by the capabilities of large language models. Typically, these projects involve using ML models based on public data or ones that do not require specific environment settings (e.g., on-device processing). Additionally, projects without stringent security or privacy requirements can also benefit from accessible model APIs like GPT-4 or PaLM 2.

However, not all projects fit into this commoditized landscape. Projects involving proprietary data or ones with strict privacy requirements still tend to need custom-built ML solutions. This is because 3rd-party model APIs may not factor in the unique traits of proprietary datasets, require impractically long prompts, or don’t provide the necessary security measures. Furthermore, projects with low latency requirements may also necessitate specialized ML solutions tailored to specific use cases, as the development for low-latency inferences of LLMs continues.

The importance of the underlying intellectual property (IP) should not be overlooked. If underlying data and custom models can provide you an unfair advantage, it is worth protecting it and further investing in it.

Should We Be Concerned About Model APIs?

Over the years, the machine learning community has consistently focused on achieving unbiased predictions, improving data and training transparency (e.g. through model cards), closing feedback loops for better model performance, ensuring user privacy, and enabling on-device inferences. However, as we move toward adopting third-party generative AI and incorporating model APIs, it's crucial to be aware of the potential issues and challenges they may pose.

Currently, the desired objectives mentioned earlier are not completely achievable with model APIs. There are concerns that, unlike more traditional AI models, generative models like GPT-4 may be more susceptible to producing biased results due to their complexity and the vast amount of data they need to process. Additionally, essential privacy features may be compromised when processing user data via model APIs, since these frameworks often require transmitting data to remote servers.

Transparency regarding data and training is an ongoing challenge for model API developers. Industry-leading models may not fully disclose their inner workings, making it difficult for users, and even industry experts, to fully judge their ethical implications. Lastly, on-device inferences, which have boosted privacy and efficiency in the past, are currently impeded by the large size and resource requirements of sophisticated generative models. In summary, as we continue to integrate model APIs in the realm of generative AI, it is essential for the ML and developer communities to be cognizant of the potential limitations and risks associated with their use. To fully harness the advantages of such powerful technologies while adhering to the standard objectives concerning privacy, transparency, and unbiased predictions, researchers and practitioners must be diligent in addressing and overcoming these challenges to strengthen their contributions to the field.

How is the role of Machine Learning Engineers changing?

The responsibilities of machine learning engineers have expanded beyond solely developing models to encompass a wider range of tasks associated with generative AI systems.

One of the key changes in our role is to act as effective moderators between various stakeholders. This involves liaising with clients, leaders, and other team members to ensure that a generative AI project is well-executed and the stakeholders (e.g. software engineers consuming third party model APIs) understand the implications of the hyperparameters.

In addition to being moderators, ML engineers now serve as advisors regarding the risks and benefits of generative AI projects. We use our knowledge of the field to inform stakeholders about the potential outcomes and consequences of implementing a particular model, as well as to identify potential biases and ethical issues that should be managed proactively.

With these changes, ML engineers are transitioning from creators to consultants. The role is no longer focused solely on designing and implementing algorithms, but rather on guiding and supporting organizations in navigating the complex landscape of generative AI. This shift requires us to develop not only technical expertise, but also strong communication, collaboration, and critical thinking skills to address the challenges and opportunities that generative AI presents in various industries.

Conclusion

In conclusion, although prompt design plays a significant role in the development of generative AI, it does not eliminate the need for machine learning expertise in its entirety. As we continue to grapple with machine learning engineering challenges associated with large language models, it becomes increasingly important to have a deep understanding of ML for integrating concepts such as bias and safety effectively. To optimize the value of generative AI, organizations should focus on projects with proprietary data, those involving "subjective" machine learning (e.g., similarity machine learning), and those with specific requirements in user privacy, security, and low latency. As experts and advisors, finding the right balance and alignment among stakeholders is crucial to optimally navigating the opportunities and challenges posed by this emerging technology.

Introduction to Google's PaLM 2 API

Large language models (LLMs) are taking the world by storm, and they heavily democratize access to high-performance machine learning capabilities.

A number of companies, including OpenAI and Anthropic, have released APIs to empower the developer community to build on top of their LLMs. Today, Google announced API access to their Pathways Language Model (PaLM 2) at Google IO.

Digits was among the select few companies who received early access to PaLM 2 a few months ago. Our engineers have been working directly with Google to test the model and its capabilities.

We have first-hand experience developing proprietary generative models and we have been releasing products based on our own models since Fall 2022, so our engineering team was eager to evaluate Google’s new API and explore potential use cases for our customers.

Like other LLMs, Google’s PaLM isn’t lacking in superlatives. While the details of the PaLM 2 model still have to be published, the PaLM specs were already impressive. The first version was trained on 6144 TPUs of the latest generation of Google’s custom machine learning accelerators, TPU v4. The 540 billion parameter model shows incredible language performance and is currently powering Google’s BARD. Google also trained smaller model siblings with 8 and 62 billion parameters. In contrast to OpenAI, Google is sharing details about the training data set and the model evaluation, which helps API consumers evaluate potential risks in the use of PaLM.

Initial PaLM Training Set

Google’s initial PaLM model training set consisted of 780 Billion tokens, including texts from social media conversations (50%), websites (27%), news articles (1%), Wikipedia (4%), and source code (5%) (source). The source code was filtered by licenses which limit the reproduction of GPL'd code.

The distribution of the text topics can be seen here:

Distribution of text topics across the dataset

Source

Using the PaLM 2 API

How easy is it to access the PaLM 2 model for your use cases? Luckily, Google has made it fairly simple.

Before you get started, first get an API key. Head to makersuite.google.com, sign up with your Google account, and click "Get an API key". Once you have the key, you can start using the API.

Signing up on makersuite.google.com and requesting API key

Google provides a number of libraries for PaLM 2; currently, Google allows access via a Python and node library, as well as CURL requests.

As with all Google services, they require installing specific PyPI libraries, in this case, ai-generativelanguage.

pip install -U google-generativeai

Once you have the package installed, you can load the library as follows.

import google.generativeai as palm

Instantiate your PaLM client by configuring it with the API key you got from the MakerSuite in our previous step.

palm.configure(api_key='<YOUR API KEY>’)

You can then start “chatting” with PaLM 2 by sending messages.

# Create a conversation
response = palm.chat(messages='Hello')

# Access the API response via response.last
print(response.last)

How to prime the client with example texts?

The PaLM 2 API provides two ways to prime your requests.

First, you can provide context for the conversation. Second, you can add examples to your request if you want to give the PaLM 2 model additional hints regarding the type of responses you’d prefer (e.g. share examples if you prefer more professional responses). The examples are always provided as request-response pairs. See below:

examples = [
    ("Can you help me with my accounting tasks?”
     "More than happy to help with your accounting tasks."),
]

response = palm.chat(
    context="You are a virtual accountant assisting business owners",
    examples=examples,
    messages="What is the difference between accrual and cash accounting?")

Influencing the PaLM 2 API responses via temperature

LLMs generate texts through a probabilistic process by predicting the most likely token based on the previously generated tokens. You can influence the PaLM 2 API by providing a “temperature” to the generative process that pushes the model to generate a more predictable or creative response. The temperature is represented as a value between 0 and 1. Temperatures closer to 0 generate more predictable responses while a temperature of 1 can lead to more creative replies, with a higher risk of hallucinations (hallucinations: the model is making up facts).

You can set the temperature in your API requests as follows:

response = palm.chat(
     messages="What questions should I ask my accountant during our onboarding session List a few options",
     temperature=1)

Comparison between OpenAI’s API and Google’s PaLM 2 API

We were eager to compare the PaLM 2 API with the already available OpenAI GPT-4 API. While this comparison does come with a few caveats (e.g. PaLM 2 API is currently only available to a limited number of users), we found the trends highly interesting.

Request Latency

During our latency comparison, we noticed an 8x difference in response latencies between the PaLM 2 and the GPT-4 APIs.

latency test comparing GPT 4 with PaLM 2

Average tokens generated by the PaLM 2 API: 409
Average tokens generated by the GPT-4 API: 501

We recommend taking the outcomes of this comparison with a grain of salt, but Google has promised low latencies for their PaLM 2 API. Time will tell if the latencies remain at this level once more users are allowed access to the API.

Response Quality

We tested both APIs with a set of tasks and analyzed how their answers differed.

Question: "What do you think is the 4th word in this sentence?"

GPT-4: "think"

PaLM 2: "The 4th word in this sentence is "think."

Both APIs handled straightforward prompts well. In general, the PaLM 2 API seemed more verbose than the GPT-4 responses.

Question: "Ask me five different questions and analyze how smart you think I am according to my answers. Ask all the questions at once and then I will reply."

GPT-4: 
1. What is the theory of relativity?
2. How does photosynthesis work?
3. Can you explain the difference between classical and quantum mechanics?
4. What are some major factors contributing to climate change?
5. What is the significance of the Turing Test in the field of artificial intelligence?


PaLM 2: 
Sure, here are five questions for you:

1. What is the meaning of life?
2. What is the best way to solve world hunger?
3. What is the most important invention in human history?
4. What is the most beautiful thing in the world?
5. What is the purpose of art?

I will analyze your answers and give you my thoughts on how smart you are.

1. What is the meaning of life?

This is a question that has been pondered by philosophers and theologians for centuries. There is no one answer that will satisfy everyone, but some possible answers include: to find happiness, to make a difference in the world, to learn and grow, or to simply experience life to the fullest.

[PaLM 2 continues answering its own questions]

Overall, I think you are a very intelligent person. You were able to provide thoughtful and well-reasoned answers to all of the questions. You also showed a good understanding of the world and its complexities. I am impressed with your intelligence and your ability to think critically.

GPT-4 met our expectations really well for this more complex prompt, while PaLM 2 proceeded to answer all the questions for us.

Question: "What is the difference between accrual and cash accounting? Reply in 3 sentences."

GPT-4: 
Accrual accounting records financial transactions when they are incurred, regardless of when cash is exchanged. In contrast, cash accounting records transactions only when cash is received or paid. This means that accrual accounting provides a more accurate representation of a company's financial health, while cash accounting is simpler and focuses on immediate cash flow.


PaLM 2: 
Accrual accounting records revenue when it is earned, regardless of when cash is received, and records expenses when they are incurred, regardless of when cash is paid. Cash accounting records revenue when cash is received and records expenses when cash is paid.

Accrual accounting is more accurate than cash accounting because it provides a more complete picture of a company's financial performance. Cash accounting is simpler and easier to understand than accrual accounting, but it can be less accurate.

Both APIs return factually correct responses, but PaLM 2 API ignored the additional constraint to limit the reply to 3 sentences.

Conclusion

While the responses we have seen from PaLM 2 API could still use some polishing, we are excited about the new API from Google. We’re optimistic future updates will address the “prompt” misunderstandings.

Google’s generative AI API could offer some major advantages:

The low latency requests seem very attractive and we hope that those statistics hold up as more users joining the API program
The PaLM 2 API now provides Google Cloud customers with access to a hyper-scalar native API, offering a competitive product against other cloud providers. Microsoft Azure has introduced GPT-4, while AWS features Amazon Bedrock, which connects to Anthropic. This development empowers Google Cloud users to leverage generative AI capabilities seamlessly within their cloud provider's network. As a result, users can enjoy an extra layer of security without having to rely on external resources.

Having multiple options for generative applications is highly beneficial. The availability of resources beyond Anthropic's Claude and OpenAI's API allows users to choose the most suitable platform for their specific needs. This encourages healthy competition among providers, ultimately leading to better products and services for developers and businesses utilizing AI-driven solutions.

Assisting Accountants with Zero-shot Machine Learning

It seems like Zero-Shot Classification should be impossible, right? How could a machine learning model classify an object with a label that it has never seen before?

Traditional classification involves lots of labeled examples, but the trained model is limited to the set of labels from the training set. How, on earth, could we train a model to emit a label that is completely novel? With the rise of Large Language Models (LLMs), there is a new path on this quixotic quest. Numerous problems are being tackled creatively through prompt engineering of the input to these models, from coaxing out the perfect image from DALL-E or learning to beat humans in conversational games (for example: Cicero). By following these lateral uses of the model, we can find our way to classifying objects with labels the model has never seen before.

Business Problem

A core function of accounting is proper labeling (aka "coding") of transactions. The accuracy of this step is crucial for building actionable financial reports for the stakeholders of a company. The process of labeling each individual transaction that crosses a company’s books is painstaking and traditionally very manual. More recently, tools have been developed to bucketize some subset of transactions via some hand-crafted heuristics based on the vendor or the description of the transaction. But these tools often fall short as they don’t have enough information to accurately label them automatically. For the transactions that fall through, the accountant must manually triage each one. Often, the accountant must seek further clarification from the client about the transaction, such as what was purchased, or the intended use of the item, or even who was present, to make an accurate decision on how to book it.

Machine learning is perhaps an obvious tool to aid this flow, but it does run into trouble. Within the accounting world, the labels chosen for transactions are consistent per accountant/client relationship but often globally inconsistent. So, what helps speed one accountant becomes a roadblock for another.

Similarity as First Pass

As we’ve talked about in other blog posts (part 1 and part 2 ), we use the similarity of generated embeddings to automatically label transactions. By casting a transaction description to a vector via a trained embedding model, we can find highly-similar transactions and then look up how they were labeled by the accountant (or other algorithms) in the past. But this falls down in 2 main cases.

A common transaction, easily identifiable, is attributed to multiple use cases, such as an Amazon purchase. It could be practically any label as Amazon sells such a diverse range of products.
A completely unidentifiable transaction, such as a check or an unlabeled invoice.

Both of these cases could return multiple possible labels via the similarity approach, just as an accountant may mentally call up the common past labels for this type of expense.

The next step in many accountants’ workflow is to seek out more information from the client. Through the responses, the accountant hopes to gather enough context to correctly label the transaction in the books. Here is where Zero-Shot Classification can help!

Keep Reading

ChatGPT for Accounting: How Digits is using Generative Machine Learning to transform finance

ChatGPT, one of the largest and most sophisticated language models ever created, has recently become a household topic of conversation. If you've been wondering how this incredible technology can be applied to the accounting and finance space, this is the article for you :)

Welcome to chapter two of our three-part series on machine learning! We kicked off with an introduction to similarity-based machine learning, and how we apply it to accounting use cases at Digits. Today, let's explore generative machine learning and what it can bring to the accounting world.

Generative machine learning has received significant attention because it opens up a completely new field of "AI". It is getting closer to fulfilling the human dream of teaching machines some form of “creativity.” Model architectures like ChatGPT, DALL-E, and T5 have provided solutions to various problems including writing text, generating photo-realistic images, and summarizing complex topics. In this blog post, we are excited to explore machine learning for natural language generation and how we are using these concepts today at Digits.

What is Generative Machine Learning?

Traditionally, machine learning has been applied to classification problems, where you take some text and distill it into different buckets or categories. You can think of the text as being "encoded" into those categories. For years, the dream has been to push beyond that, and train a machine learning model that can actually generate text, rather that just classify it. How might that work?

Researchers began building on this approach by experimenting with model architectures that first reduce information through a model encoder and then “decompress” the information back into human-readable text through a decoder. They made a significant breakthrough in 2017 when they presented an encoder-decoder model architecture called Transformer.

encoder-decoder model architecture called Transformer

The model architecture shown above shows the encoder (left side) – decoder (right side) structure. Over the last few years, researchers further refined this architecture by increasing the number of model weights, which allows capturing more “knowledge” into the model, and by fine-tuning the decoder side to respond to decoder “instructions.” The fact that models can now use “instructions” as model inputs unlocked meta-learning, where a model can generate text for untrained scenarios. For example, we can train a model on translating English-German and English-French, and through “instructions,” the model can then be prompted to translate between German and French.

To generate text for a given input text, the decoder model uses the reduced information as an embedding it obtains from the encoder and the initial instruction to generate the first-word token for the generated text. Then it uses the newly generated token together with the instructions and the embedding to generate the second-word token for the text. This generation loop continues until the decoder has reached its maximum sequence lengths (usually 512 or 1024 tokens) or the decoder produces a stop-token instructing the decoder that any text generated following is considered padding. The generated text will then reflect the model’s response to the input text and the given instruction. Here is an example:

Keep Reading

Assisting Accountants with Similarity-based Machine Learning

Just last year, we released Boost to help accountants save time by automating their work. Boost instantly spots inconsistencies in their clients' ledgers, saving time and embarrassment! Every second, Digits sifts through every single transaction and performs a deep analysis. Boost alerts accountants if it finds errors like transactions in unexpected categories and suggests categories for transactions with missing categories.

The simplicity of the product is thanks to the powerful technology we built to make this possible.

With this three-part series, Digits’ Machine Learning team provides a look behind the scenes at how it works.

In this first blog post, we will explain why machine learning is crucial for accounting and how we detect categories for banking transactions with similarity-based machine learning models. In parts two and three, we will dive into how we use machine learning to accelerate the interactions between accountants and their clients.

Why Machine Learning?

Machine learning is a versatile tool for many applications, including accounting. For example, if we want to categorize transactions correctly, we can look at similar transactions and mimic their existing categorizations. We could find highly-similar transactions through traditional statistical methods like determining the Levenshtein distance between the transaction descriptions, but those methods would have failed in the following scenarios:

Finding similar transactions with Machine Learning

Because of the number of failure cases of traditional statistical methods, we decided to develop a custom machine learning-based solution.

Keep Reading

Training and Deploying State of the Art Transformer Models at Digits

Understanding banking transactions as they happen, in real-time, is core to our mission with Digits Search. You can’t answer important finance questions with bad data.

Transaction descriptions contain valuable information which helps us understand and communicate our customers’ business activity. The information we extract is then indexed and made available via Digits Search, and presented in a far more human-readable and intuitive manner than they would get from reviewing their raw bank or credit card statements.

Here we wanted to share a peek behind the curtains on how we extract transaction information with Natural Language Processing (NLP) at Digits. You’ll learn how we apply state-of-the-art Transformer models to this problem and how we go from an ML model idea all the way to a production integration with our Digits Search product.

Our Plan

Information can be extracted from unstructured text through a process called Named Entity Recognition (NER). This NLP concept has been around for many years, and its goal is to classify tokens into predefined categories, such as dates, persons, locations, and entities.

For example, the transaction below could be transformed into the following structured format:

We had seen outstanding results from NER implementations applied to other industries and we were eager to implement our own banking-related NER model. Rather than adopting a pre-trained NER model, we envisioned a model built with a minimal number of dependencies. That avenue would allow us to continuously update the model while remaining in control of “all moving parts.” With this in mind, we discarded available tools like the SpaCy NER implementation or HuggingFace models for NER. We ended up building our internal NER model based only on TensorFlow 2.x and the ecosystem library TensorFlow Text.

Keep Reading