Apple's DCLM models are currently the top-performing and truly open-source models. But what does Apple mean by truly open source? How is that any different from other prominent open-source models? By truly open-source, Apple implies that all the weights, training codes and datasets are freely accessible alongside the model.
It's unexpected to see Apple leading the AI race, particularly with fully open-sourced models, while even prominent AI leaders like OpenAI don't release open-source models. Yet, Apple has introduced its latest 7B parameter model for anyone to use or adapt. The DCLM-7B has already outperformed Mistral-7B in certain benchmarks and is reaching performance levels comparable to similar models from Meta and Google.
Developed by the DataComp for Language Models (DCLM) team, DCLM-Baseline-7B is a 7 billion-parameter language model that shows the impact of systematic data curation. This decoder-only Transformer, primarily trained on English text, is available under the Apple Sample Code License. The model was released in June 2024 and all its relevant code, instructions and models can be accessed through its GitHub repository.
The DCLM-7B is considered one of the best-performing open-source language models and here’s why:
Availability: The model is available on Hugging Face and integrated with Transformers. So you get easy access and seamless integration into various NLP workflows.
The DCLM-7B model shows strong performance in various benchmarks, particularly when compared to other models in the 7B parameter range. Here is a detailed breakdown of its performance metrics:
The performance metrics of DCLM-7B are as mentioned below:
When compared to other 7B models, DCLM-7B stands out in several key areas:
Open Weights, Closed Datasets:
Open Weights, Open Datasets:
Before getting started with DCLM-7B, make sure you have set up your account (see instructions here).
Once you have an account, you can launch a new VM on Hyperstack:
Now you can launch a new Virtual Machine.
For inference on full precision (float32), we recommend NVIDIA L40 as this model requires +- 30GB GPU vRAM.
Also Read: How to Run Jupyter Notebooks Like a Pro on Hyperstack
SSH into your machine (see instructions here). Now, install the required packages as mentioned below:
#install python3-pip and python3-venv
sudo apt update -y
sudo apt install python3.10-venv python3-pip -y
#optionally, create a new virtual environment and activate it
python -m venv .env
source .env/bin/activate
#install transformers
pip3 install transformers
#install open_lm
pip3 install git+https://github.com/mlfoundations/open_lm.git
The Python code below downloads the model and runs inference. Please note that we explicitly move the model and input to the Cuda device.
from open_lm.hf import *
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("apple/DCLM-Baseline-7B")
model = AutoModelForCausalLM.from_pretrained("apple/DCLM-Baseline-7B").to('cuda')
inputs = tokenizer(["Machine learning is"], return_tensors="pt").to('cuda')
gen_kwargs = {"max_new_tokens": 50, "top_p": 0.8, "temperature": 0.8, "do_sample": True, "repetition_penalty": 1.1}
output = model.generate(inputs['input_ids'], **gen_kwargs)
output = tokenizer.decode(output[0].tolist(), skip_special_tokens=True)
print(output)
While DCLM-Baseline-7B is a strong and capable model, it does have some limitations that you should be aware of:
Sign up now to get started with Hyperstack. To learn more, you can watch our platform demo video below:
What framework is used to train DCLM-7B?
DCLM-7B is trained using PyTorch with OpenLM.
Where can I find the DCLM-7B model and dataset?
The model is available on the DCLM GitHub Repository and the dataset on Hugging Face.
What is the total number of training tokens used for DCLM-7B?
DCLM-7B was trained on a total of 2.5 trillion tokens.
Are there any specific ethical considerations when using DCLM-7B?
Yes, be aware of potential biases in the training data and ensure responsible usage with appropriate safeguards.