deepseek intelligenza artificiale cinese

DeepSeek-R1: Performance and Quality Testing on Seeweb CLOUD GPU MI300X

This article explores the performance and inference quality of the open-source LLM DeepSeek-R1 671B on Seeweb's CLOUD GPU MI300X. Using the Ollama framework, the model was tested for token throughput, GPU RAM usage, and translation quality in Arabic–English machine translation. Results show that DeepSeek-R1 rivals top-tier models like GPT-4o, offering strong zero-shot performance and high efficiency even on massive hardware setups.
Indice dei contenuti

Introduction

Large Language Models (LLMs), the linguistic Deep Learning models at the base of the Generative AI boom, starting with the appearance of ChatGPT in November 2022, are enormous neural networks (whose unit of measurement is billions of parameters) and are constantly growing. In the evolution t of these models, with their pursuit of always more impressive performances and coverage of increasingly complex use cases – which in turn enable ever-smarter applications – the dominant paradigm remains “bigger is better”. Creating neural networks with ever-larger dimensions allows to capture increasingly sophisticated representations of human languages (the plural is important – as will also become evident a little further down – because LLMs are now a globalized industry) when it comes to both comprehension and generation.

So, even though all LLMs are large, some are MUCH larger than others, and those are precisely the ones that capture users’ imagination, newspaper headlines, and, inevitably, the biggest slice of revenue for the companies and laboratories that develop and train them. In fact, even when these “top-tier” models are placed in the public domain, or are released under an open source license, for most organizations and businesses it is impractical – if not impossible – to host and run inference on LLMs of the size and power of, e.g., GPT-4o, Gemini 2.5, or Grok-1 on their own premises. That is the most important reason for the popularity and growth of  LLM-based services, according to which the functionality of top-tier models is accessed – and paid for – by customers through API calls to a specialized service provider.

The status quo on these top-tier models, however, has recently begun to be challenged.  In January 2025, for example,  DeepSeek-R1 was disclosed to the general public: it is an LLM that rivals, and in several use cases surpasses, the linguistic capabilities of models like GPT-4o; it indeed is VERY large, since its most capable, flagship version amounts to 671 billion parameters; and  it has been released under the liberal MIT open source license. Although not the first open-source LLM in the hundred-billion parameters range, DeepSeek-R1 caught the attention of so many also because of some clever architectural choices, which hold some promise when it comes to attempting to host a 671 billion parameters LLM on hardware that is commercially available, although – obviously – top-of-the-line.

The most important metrics for evaluating whether such hosting is even feasible, boil down to GPU RAM occupation (that is, how many GB of RAM must my GPUs have to hold the necessary software components and the state of the model in memory at inference time) and token throughput (that is, how many textual sub-words can the model generate in a second).

In the rest of this post, we will describe an experimental setup for the largest DeepSeek-R1 model, which we have deployed on one  Cloud instance available from the SeeWeb Cloud provider: the mighty CLOUD GPU MI300X, which bundles together 8 MI300X GPUs by AMD, each with 192 GB of RAM on board (for a total of 1536 GB  of GPU RAM) and 16 TB of disk space. Moreover, we will share the results of our experiments running the model, together with some significant insight on DeepSeek that we could elicit from our experiments.

No-hassle installation

Although one might justifiably think that installing a behemoth of an LLM like DeepSeek-R1 in its 671 billion parameters incarnation (DeepSeek-R1 671B in the remainder) on a sophisticated machine like the CLOUD GPU MI330X would be a challenging task, requiring specialized Cloud and MLOps skills, it turned out to be rather simple. We directly leveraged Ollama, which is one of the most popular choices for downloading, hosting and serving LLMs for private use on one’s own machine. The Ollama framework has been designed and built primarily for personal use; as such, it prioritizes simple installation, quick deployment and high ease of use, rather than configurability, flexibility, enhanced performance or other quality factors that may become quite relevant in an Enterprise context. Other frameworks such as VLLM and llama.cpp hold promise to be more efficient in running resident LLMs. However, Ollama is unparalleled when it comes to enabling quick experimenting with a given LLM, and is rather fast to integrate the newest models and make them available to its user base.

In fact, Ollama supports a variety of DeepSeek-R1 models (see https://ollama.com/library/deepseek-r1), starting with a number of smaller-footprint distilled versions of the flagship model[1], which range from as little as 1.5 billion parameters, up to 70 billions. Of course, Ollama supports also the full DeepSeek-R1 671B, providing binaries with three levels of quantization for the model weights: 4 bits, 8 bits, and 16 bits floating point precision. The latter is the one that is closest to the precision with which DeepSeek-R1 was originally trained, but is also the one that requires the most RAM to run.

Getting DeepSeek-R1 671B installed was as simple as installing the Ollama framework for Linux, and then issuing the command:

ollama run deepseek-r1:671b-[quantization id]

where `quantization` id can be any choice among the following: `q4_K_M`, `q8_0` or `fp16`

After that, Ollama deals seamlessly with downloading all the binaries, initializing and loading the LLM in memory, and with managing any swapping of locally resident models by loading and unloading them into the GPU memory as requested by the run command above.

Ollama also provides an endpoint to query the model,  either through REST typically on

http://localhost:<port-num>/api/

or directly using the `ollama` client library that is available as a python package.

Model performance

Leveraging ollama, we carried out some tests on all of the three model versions discussed above: `q4_K_M`, `q8_0` and `fp16`. Below you can find the results of our performance testing. Notice that we limited our testing to inference, and did not attempt to run any learning task, such as fine tuning, since we believe that the majority of use cases for DeepSeek-R1 671B on the part of an end user (individual or organization) are going to be about zero-shot inference and prompt engineering.

model  framework  GPUs used   VRAM %  GPU %  token/s (MT)
deepseek-r1:671b-q4_K_M  ollama 8 31 13 17.5
deepseek-r1:671b-q8_0  ollama 8 52 14 17.7
deepseek-r1:671b-fp16  ollama 8 93 14 16.1

From this table, we can observe several interesting things.

First of all, we were able to take advantage of the full set of AMD GPUs on board the CLOUD GPU MI300X. This happened out of the box and did not require any special setting or configuration, which is a testament to the quality of the machine setup provided by SeeWeb on its high-end Cloud instances, as well as the great user-friendliness of the Ollama serving framework.

Second, the metrics show how the task of serving a very large GenAI model like DeepSeek-R1 671B is essentially memory-bound. The values of VRAM occupation in the table are to be intended as peak use; also, typically the RAM occupation spreads quite evenly on all the 8 GPUs. As expected, the fp16 version claims almost all of the available GPU RAM. But even in that case, the GPUs do not look computationally stressed: in all three cases the percentage of GPU computational power necessary to run LLM inference never surpassed 15% on any GPU (and again, the load was nicely shared  more or less equally in between all the GPUs).

Third, the token throughput does not change very much when running a given inference task on each of the three versions of  DeepSeek-R1 671B. We used a Machine Translation (MT) task for this test (more on that below) and only with the larger fp16 version the throughput fell below 17 token/s. This is a moderate throughput, which is definitely acceptable, especially for most batch inference tasks; it is likely adequate, although somewhat on the slow side, also for many interactive tasks.

It is worth remarking that the serving framework may play a significant part particularly with this metric. Selecting a different serving framework and configuring it opportunely for the CLOUD GPU MI300X instance may unlock higher throughputs for DeepSeek-R1 671B, as reported for example here: https://blog.seeweb.it/complete-guide-to-deploying-deepseek-r1-on-amd-mi300x-gpus-open-webui-enterprise-ai-solution/#elementor-toc__heading-anchor-3.

The best way to think about our token throughput results is that they indicate a solid lower bound for what can be obtained with a powerful LLM server such as the CLOUD GPU MI300X.

Inference quality

As mentioned above, we have carried out most of our experiments using a Machine Translation task for DeepSeek-R1 671B. Once it became clear that the basic performance metrics such as VRAM occupation, GPU % and token/s were reasonably stable across the 4-bit, 8-bit and fp16 versions of DeepSeek-R1 671B, the next question that came naturally was whether the quality of inference would instead vary significantly across those three models. To evaluate that, MT is a good task, because it is well defined, there are benchmarks and reference evaluation datasets that can be readily used, and there are several quality metrics that simplify measuring automatically and objectively the quality of the translations.

For our tests, we selected MT in between English and Arabic, as we were curious to experiment with the multilingual capabilities of DeepSeek-R1 using some language that is typically not prevalent in most LLM training corpora.

We leveraged a recent dataset that has been made publicly available on Huggingface at: https://huggingface.co/datasets/mohamed-khalil/ATHAR. That dataset comprises 66000 Arabic sentences, lifted from a variety of texts in classic Arab literature, together with their English translations, curated by experts, and has been introduced in the following paper: https://arxiv.org/pdf/2407.19835.

That paper is especially relevant for our purposes since – among other things – reports on the result achieved by a variety of LLM in translating the dataset from Arabic to English. Among the LLMs that are tested by the authors on this specific MT dataset, there is one that is in the same “top-tier” class as DeepSeek-R1 671B (hence directly comparable in terms of quality), i.e., GPT-4o, which was developed by OpenAI and released in May 2024. The quality metrics reported in the paper are METEOR, ROUGE and SacreBLEU, which are all well-known and well-accepted metrics for MT tasks.

For our experiments, we adopted the METEOR metric, which was introduced in 2005 by this paper: https://aclanthology.org/W05-0909.pdf. It is a metric whose scores are usually particularly well aligned with human expert assessment of translations. Also, the METEOR metric was originally developed and evaluated using Arabic-to-English, as well as Chinese-to-English, translation datasets. 

For our experiments we selected randomly a thousand sentence pairs from the aforementioned dataset, and engineered a simple system prompt to ask DeepSeek-R1 671B to translate the sentence samples in the Arabic-to-English direction:

“You are an expert Arabic speaker and translator. You are also a polyglot. I am going to give you a sentence in Arabic. You will translate it in English. Return a JSON object that contains two fields: “Arabic” and “English”, with, respectively, the original Arabic sentence and the corresponding English translation. Do not return anything else. You will be penalized if you return anything besides the JSON object described above.”

We measured the quality of the Arabic-to-English translations that we obtained this way, using the implementation of the METEOR metric included in the well-known NLTK python package, which is a reference library for Natural Language Processing in python. This qualifies as a zero-shot MT experiment, as there was no opportunity for the LLM to learn or see any examples from the selected datasets.

We then flipped the system prompt, and tested in the same way also the translations in the English-to-Arabic direction. The table below presents our results; the The MT-specific results are highlighted. Notice that the METEOR scores are averaged over all the translation samples.

model  framework  VRAM % GPU % token/s (MT) METEOR score A–to-E METEOR score E-to-A
deepseek-r1:671b-q4_K_M  ollama 31 13 17.5 0.393 0.167
deepseek-r1:671b-q8_0  ollama 52 14 17.7 0.371 0.163
deepseek-r1:671b-fp16  ollama 93 14 16.1 0.392 0.164

There are a few significant takeaways from our MT experiment that are worth discussing.
First of all, DeepSeek-R1 671B proves to be at least as good as other top-tier LLMs in Arab-to-English translation: in their paper, the dataset authors report that GPT-4o achieved an average 0.357 METEOR score in analogous zero-shot conditions. DeepSeek-R1 671B was able to do in fact even a little better on a random subsample of the same dataset.

Second, all of the three versions of  provide DeepSeek-R1 671B reasonably stable results; the small dip observed for the 8-bit quantized model in the Arab-to-English score may be just an artifact; what seems remarkable is that the 4-bit quantized model is on par with the fp16 model.

The most important insight, though, is possibly the third one, that is, the significant performance gap in the METEOR score displayed by DeepSeek-R1 671B in between the Arab-to-English and the English-to-Arab direction.

Our best hypothesis for this result is, on the one hand, related to the well known imbalance that exists most of the time in LLM training corpora. Their zero-shot performance on so-called low-resource languages (languages with comparatively less digital textual data that can be amassed in a training corpus with respect to other languages, such as English, Spanish or Chinese) is often lower. On the other hand, however, that alone cannot explain the large gap in between the two directions, which both involve a low-resource language (Arabic). We postulate that – since the main functionality of a GenAI model like an LLM is exactly the generation of text —  when generating translations in a high-resource language like English, the LLM is able to better adjust the final text it produces to what can be reasonably expected in an English sentence carrying the desired semantics, as opposed to its final outcome when generating an equivalent sentence in a low-resource language like Arabic.

Unfortunately, since no benchmarking on English-to-Arab zero-shot translation scores were reported by the authors of the dataset in their LLM experiments, it was not possible in the context of this experiment to further investigate this phenomenon, and validate or dismiss the hypothesis above. Further testing on the performance of English-to-Arabic translations by top-tier LLMs with MT-specific objective metrics like METOR needs to be available to shed further light on this.

Conclusions

We have reported our experience hosting and serving privately  DeepSeek-R1 671B, an open source top-tier LLM with hundreds of billion parameters, on top-of-the-line hardware, made available as a Cloud instance by SeeWeb with their CLOUD GPU MI300X commercial offering.

We showed that this is possible in a simple, out-of-the-box way, by leveraging the Ollama LLM hosting framework, even for the largest variant of DeepSeek-R1 671B, the fp16 precision one. Besides, the basic performance when running inference on the LLM hosted with this particular setup is adequate for most batch tasks, as well as many interactive ones.

With this setting, we were able to run some GenAI tasks seamlessly. We focused on English ←→ Arabic Machine Translation tasks, and benchmarked the translation quality of DeepSeek-R1 671B in both directions. We observed that the performance of the open source DeepSeek-R1 671B is on par and in some cases may even surpass that of other top-tier LLMs, which are not readily available for private hosting. We also observed that the translation quality in between the Arabic-to-English and English-to-Arabic directions is quite different, as it is significantly lower for the latter.

This should be taken as a cautionary tale of the pitfalls of working on Generative AI tasks involving low-resource languages, even when employing top-tier LLMs.

 

This article was written by Giuseppe Valetto and Matteo Mendula of Deep Learning Italia Srl and AI Venture Builder.


[1] A distilled LLM is a fine-tuned version of another, larger LLM. The distillation process allows to capture the vast majority of the linguistic and reasoning patterns learned by the original LLM during its extensive training in a smaller neural network with an analogous architecture. There is always some price to pay in terms of loss in quality by the distilled models, but it is often small, and the results of distillation are typically better than what would be achievable by training the smaller network from scratch on the same corpus used for the original, non-distilled LLM.

CONDIVIDI SUI SOCIAL

Lascia un commento

Il tuo indirizzo email non sarà pubblicato. I campi obbligatori sono contrassegnati *

+ 35 = 40