• CramHacks
  • Posts
  • The Bug Bounty Gold Mine: AI/ML third-party packages

The Bug Bounty Gold Mine: AI/ML third-party packages

The AI race has created a cesspool of third-party packages

Hey all, welcome to CramHacks! If you’re reading this, it’s likely no surprise that cybersecurity is commonly an afterthought. I’ll stay neutral on this for the time being, but the latest in Artificial Intelligence and Machine Learning (AI/ML) certainly hasn’t proved to be any different.

This post intends to document and highlight concerning trends that I’ve been seeing in regard to AI/ML development.

Table of Contents

Software Supply Chain Security

What is software supply chain security? For those of you who are subscribed, you’ve likely seen my rants. For simplicity’s sake, we’ll go with a standard definition provided by FOSSA:

“Software supply chain security refers to the practice of identifying and addressing risks in the technologies and processes that are part of software development. The links in the software supply chain extend from development to deployment and include open source dependencies, build tools, package managers, testing tools, and plenty in between.”

Supply chain security isn’t new, but it’s seen tremendous growth in the last few years. Tying in risk, SonaType recorded twice as many incidents in 2023 versus 2019-2022 combined.

Artificial Intelligence

Similarly, there are numerous definitions for artificial intelligence. You know… Humans are capable of extraordinary things, but agreeing on a definition surely isn’t one of them. Anywho, Stanford Professor John McCarthy coined the term Artificial Intelligence (AI) in 1955 and defined it as:

“It is the science and engineering of making intelligent machines, especially intelligent computer programs. It is related to the similar task of using computers to understand human intelligence, but AI does not have to confine itself to methods that are biologically observable.”

Odds are that you’ve only truly cared about AI sometime in the last few years. So, if you missed it, I’ll highlight that this term was coined in 1955!

AI is more than just natural language processing (NLP) and Large Language Models (LLM), such as OpenAI’s ChatGPT and Google’s Bard. But given their adoption, I suspect we’ll find the vast majority of security research and tooling to revolve around these.

Model Cards

Model cards in AI are an approach to improving transparency and trustworthiness in artificial intelligence systems. These cards serve as standardized documents that provide critical information about a machine learning model, including its intended purpose, performance, and potential biases.

For example, look at the (now archived) GPT-3 Model Card, last updated in September 2020. The card includes:

  • Model Details (Date, Type, Version, Papers & Samples)

  • Intended model use

  • Data, Performance, and Limitations

Others, like Facebook’s llama model card, detail hardware and software data and evaluation results.

Hugging Face, which has built a platform for collaborating on models, datasets, and applications, has invested quite a lot of time and resources into standardizing model cards. In December 2022, Hugging Face released a model card guidebook and resources, such as an annotated model card template detailing how to fill the card out.

Model cards in AI, primarily created through manual processes, significantly enhance the understanding of machine learning models, though they don’t directly impact software supply chain security. These manually crafted documents offer detailed information, including model specifics (date, type, version), intended use, data sources, performance metrics, and limitations. This rich detail provides deeper insights into AI models and their potential implications, encompassing security considerations.

While model cards don’t directly bolster software supply chain security, their comprehensive insights into AI model design, capabilities, and limitations are invaluable. They play a crucial role in elucidating how these models might influence the overall security and integrity of software systems, thus aiding stakeholders in making well-informed decisions regarding the integration and management of AI within these systems.

This by no means makes them “open-source.” I keep seeing models labeled as open-source, but the training data isn’t publicly accessible 🤔.

AIBOM / MLBOM

Those familiar with supply chain security are likely well-acquainted with the term “software bill of materials” or “SBOM.”

Let me introduce you to the “artificial intelligence bill of materials” or “AIBOM” for short. Seemingly the industry can’t agree on AIBOM vs MLBOM, but yes, MLBOM stands for “machine learning bill of materials.”

Do I like it? No, not really. SBOMs are a mess, and AIBOMs are even more so. But that doesn’t mean we shouldn’t work towards making them both more viable.

Well, the U.S. Air Force program AFWERX has recently granted an innovation contract to Manifest, in partnership with Tufts University, to tackle the problem of AI Supply chain security - and more specifically, the management of artificial intelligence bills of materials (AIBOMs).

At first, I looked at their GH repository for Manifest’s MLBOM, which I hated because the naming is super confusing. I know AI and ML aren’t the same, and I know that they know, but the documentation refers to it as both AIBOM and MLBOM, which makes me 😢.

The GH repo details the proposed MLBOM Model, which, in my opinion, looks very similar to an ideal model card. So now I have the dilemma of whether or not it’s worth pursuing AIBOMs or if we should be developing model cards and SBOMs separately and then aggregating their details. This seems to resemble CycloneDX’s work. At first glance, the most significant addition is library/dependency details and attestations for integrity validation.

Manifest has released a whitepaper, “MLBOMs: Transparency into AI” (ungated link here), which I can appreciate. It felt honest that there are so many unknowns and that we are very far from a viable solution - but there is a need for it, and the community will have to contribute. Two things I’d note from the whitepaper was the discussion around automating AIBOM generation through techniques such as scraping Hugging Face and tracing model lineage.

Bug Bounty Gold Mine

“The world’s first bug bounty platform for AI/ML”

Until late 2023, huntr.com was huntr.dev and was a bug bounty platform for all open-source projects. After being acquired by Protect AI in August, the domain was updated, and it now only accepts bug bounty submissions impacting AI/ML projects and dependencies.

Corporations are going nuts about AI/ML bug bounties:

Having reviewed thousands of open-source software vulnerabilities this past year, I can confidently say AI/ML dependencies are a work in progress. The amount of trivial, arbitrary command execution vulnerabilities is… concerning. But hey! It’s a race.

Here’s just one of the many examples that I’ve reviewed recently:

Command injection in Paddle
This project has well over 20K stars on GH but is passing unsanitized values to a shell command. There are several other critical security advisories very similar to this: GitHub Advisories.

def _wget_download(url, fullname):
    # using wget to download url
    tmp_fullname = fullname + "_tmp"
    # –user-agent
    command = f'wget -O {tmp_fullname} -t {DOWNLOAD_RETRY_LIMIT} {url}'
    subprc = subprocess.Popen(
        command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE
    )

You can read plenty more advisories and find other open-source AI/ML projects with a bug bounty program on huntr.com.

In my opinion, you do not need to be an AI/ML expert or a security expert to find qualified bugs. Heck, I think ChatGPT can likely find many of these. That said, you can also try to use a static code analysis tool like Semgrep open-source or even the full paid platform since it’s free to use for OSS projects.

PyPi Public Package Repository

PyPi plays a huge part in AI/ML development. Most projects are being written in Python and are then, of course, reliant on PyPi for third-party packages.

With ~500,000 packages, PyPi is far from the largest ecosystem.

However, in terms of security advisories (as per osv.dev) PyPi is keeping up:

ReversingLabs recently released their The State of Software Supply Chain Security 2024: Key takeaways, which highlights some concerning trends such as:

  • in comparison to 2022, ReversingLabs detected 400% more malicious packages in 2023

  • OpenAI accounted for 19% of all leaked secrets on the PyPI platform in 2023

Let’s not forget exposed secrets. Late in 2023, news broke of more than 1,500 exposed API tokens on the Hugging Face platform. In the list of exposed secrets was a token that offered full access to Meta’s Llama 2.

This isn’t all that surprising to me, but it’ll be interesting to see how PyPi adapts, given their definite growing threat landscape. In late 2023 PyPi required 2FA for maintainers; that’s definitely a start!

Conclusion

The evolving landscape of software supply chain security, particularly in AI, presents challenges and opportunities. The development and utilization of model cards, while not directly securing the software supply chain, provide essential transparency and understanding of AI models. These manually created documents illuminate the intricate details of machine learning models, assisting stakeholders in making informed decisions.

The concept of an Artificial Intelligence Bill of Materials (AIBOM), albeit in its nascent stage and facing implementation challenges, represents a forward-thinking approach to managing AI in the software supply chain. The efforts by organizations like Manifest and initiatives such as huntr.com’s bug bounty program underscore the increasing attention towards securing AI/ML components.

Overall, the field is in a dynamic state, with ongoing efforts to standardize, secure, and understand AI components in software systems. As AI becomes more intertwined with everyday technology, robust security measures and transparent documentation like model cards and AIBOMs become increasingly vital. The journey towards a secure and transparent AI-integrated software supply chain is challenging but essential for the future of reliable and safe technology.