Cline: Use Local Workstation

Cline: Use Local Workstation
Photo by Lilian Do Khac / Unsplash

My son has recently upgraded his computer, so I got a five-year-old machine with the RTX 3060. Let's see if it could be used with common tools and AI agents.

Allow me to skip all boring details of some system configurations, such as:

  • Install the second operating system - Ubuntu 24.04 Server
  • Install Nvidia drivers and Docker configuration.
  • Configuring user access and remote access to the console.

And talk through:

Set up Ollama Service

Now that my server is up and running in the basement, I can enjoy a quiet environment and remote access, just like any other "i-don't-know-where-you-are" machine. Test of the Docker setup and drivers:

~$ sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi
       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06              Driver Version: 580.65.06      CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060        Off |   00000000:08:00.0 Off |                  N/A |
|  0%   45C    P8             14W /  170W |      32MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+
~# 

Nvidia System Management Interface Output

As you can see from the output, all drivers and plugins are functioning properly, Docker has access to the GPU, and the output details match my hardware configuration.

There are multiple tools to run local models, and for the large language models (LLM), the most obvious choice is Ollama. It's really easy to get it started on the Ubuntu server. Practically, all you need to do is open remote access to the default service port and start the Ollama container with GPU access.

~$ sudo ufw allow 11434
Rule added
Rule added (v6)
~$ sudo docker run --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all -d \
 -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
65cf24258b70796ae946500806cc3e5a642d327eb5004f96f5e8f7500c744016
~$ sudo docker ps
CONTAINER ID   IMAGE           COMMAND               CREATED         STATUS         PORTS                                             NAMES
65cf24258b70   ollama/ollama   "/bin/ollama serve"   7 seconds ago   Up 6 seconds   0.0.0.0:11434->11434/tcp, [::]:11434->11434/tcp   ollama

Now, we have a working service that accepts connections on port 11434 locally and from the network. Let's pull one of the smallest models to test the configuration.

Pulling the new model.

You can run other CLI commands from the ollama container to ensure everything works as expected. Now, let's try to use the Ollama API to access models.

Please keep in mind, that we discuss home setup with the default settings and non-secure parameters. For business configurations, please consider other tools and encrypted communications.

Access Ollama Remotely

Most of the commands you run with the Ollama CLI are available through the RESTful API. To avoid using the IP address in configurations, I make my server available by FQDN for my applications on my laptop. The ollama CLI takes the OLLAMA_HOST environment variable value to access the service API and handle requests. That allows me to control the remote machine.

MacBook-Air-3 ~ % export OLLAMA_HOST=mylocalai.local:11434
MacBook-Air-3 ~ % ollama ls
NAME                            ID              SIZE      MODIFIED       
embeddinggemma:latest           85462619ee72    621 MB    19 minutes ago    
llava:latest                    8dd30f6b0cb1    4.7 GB    4 hours ago       
hhao/qwen2.5-coder-tools:14b    8897bf4e1dc7    9.0 GB    7 hours ago       
gpt-oss:latest                  aa4295ac10c3    13 GB     22 hours ago      
qwen2.5vl:7b                    5ced39dfa4ba    6.0 GB    24 hours ago      

CLI with Remote Ollama

Same output, but using the curl command. Service /api/tags returns all local models. Running the reply through the jq gives you a formatted JSON response.

Available Models on the Server.

I have a few models handy and could run a request:

 ~% curl -s http://$OLLAMA_HOST/api/generate  -H "Content-Type: application/json" \
      -d '{ "model": "gpt-oss", "prompt": "Compose a short poem about technology.", "stream": false }' | jq
{
  "model": "gpt-oss",
  "created_at": "2025-09-24T20:04:02.119233552Z",
  "response": "Pixels pulse, a silent heartbeat,  \nWires weave the world’s new song.  \nIn silicon dreams we find our paths—  \nA future born from code and light.",
  "thinking": "User wants a short poem about technology. Should be concise. No other constraints. Just produce a poem. Could be 4-8 lines. Let's deliver.",
  "done": true,
  "done_reason": "stop",
  "context": [
    200006,
    ....
  ],
  "total_duration": 3269309710,
  "load_duration": 117729695,
  "prompt_eval_count": 74,
  "prompt_eval_duration": 51269408,
  "eval_count": 77,
  "eval_duration": 3099838121
}

Although you can accomplish a lot with the command line and curl, my goal is to utilize my local service with the comfort of well-known tools.

Applications and Tools

To learn about model capabilities and master API usage, I recommend starting with the Postman Ollama REST API project. Fork it to your account and make the necessary adjustments. Your cloned project has two variables: {{baseURL}} and {{model}}. Set them to the appropriate values and run a request. As expected, the request will fail, although Postman does its best to route your requests. Unfortunately, you may need either a desktop application or a Postman Agent. With the Postman Agent up and running, the standard Web application automatically detects the agent's presence and reroutes all API calls through your local network.

Successful Call through the Postman Desktop Agent.

Now, I'm ready to try my local system with a challenging workload.

Some time ago, I installed the VS Code Cline extension to try different cloud providers and services. Of course, built-in Copilot provides some room and opportunities, but Cline offers unparalleled capabilities in fine-tuning models and a full spectrum of available providers. To start using my local server

  1. Open the Cline bot panel
  2. Click on the provider link and select Ollama from the list of providers
  3. Click on "Use custom base URL" and give your ollama server name and port
  4. Skip the API Key and enter the model name that you want to use. For demo purposes, I selected a marvelous Qwen 3.5 with added coder tools capabilities.
  5. Set the context size to 16K, as recommended for this model
  6. Close the configuration dialog, and your very private Cline agent is ready for work.
Setup the Remote Ollama Provider

With a newly configured provider, I tasked the agent to update my boilerplate Python app and replace the printing output. The task was completed with no issues.

Cline Rides Ollama

Advantages and Trade-offs

It's only natural to ask myself, is it worth bothering at all? To that question, you may find answers on your own. For me, it's more entertainment than real use, yet there are plenty of reasons to have your very own agent. To finish this long read on a positive note, I'll start with trade-offs.

✗ Size of the models. To run the most effective LLMs, you need gigabytes and gigabytes of expensive GPU RAM. You need to juggle quantization, context, and parameter size to find the appropriate model that will run effectively on available hardware.
✗ To get better results, the cost of hardware could be prohibitive, unless you find an unlimited source of cheap electricity and very inexpensive accelerators.
✗ Price for the hosted LLM calls drops and you may get better results, for less money using the Cline with providers like OpenRouter.
✗ Even though you run it locally, the response time for big, reasoning models will teach you patience.

And why you may wnat it:

✓ It's is yours and yours only. No credit card information required.
✓ You could get good results even with a modest GPU accelerator, just remember more GPU RAM is better that more computer memory.
✓ Over time, local model may be a less expensive alternative to the hosted internet models. Especially, if you didn't pay much for the hardware.
✓ Even if your internet is down, you still have access to the model. Well, if your Wi-Fi is up and running.
✓ You will learn a lot about models, paramters, optimiations, and other local tools like LM Studio. (Yes, Cline support it too).