Running my business on local AI: six months in, what actually broke
15 June 2026
Six months ago I set up an AI system to help run my businesses locally, on hardware in my office, with the explicit constraint that no business data leaves the machine. Here is what actually happened.
Why local
The short version: my businesses handle tenant data, bank reconciliation records, and property transaction documents. Sending those through a cloud AI service means the data lives on someone else’s server, processed by a system I do not control, stored for a period I cannot fully audit.
I am not opposed to cloud AI for general use. For my specific situation, where the data contains people’s addresses, payment histories, and contract terms, local was not a philosophical preference. It was a practical requirement.
The setup
The hardware is a machine with an RTX 3060 with 12 GB of VRAM, 64 GB of system RAM, and a fast NVMe drive for the model weights and document store. The models run quantized, which means I am trading some capability for the ability to fit them in the available VRAM.
For the past six months I have been running a mixture of Mistral-based models for document analysis and a smaller model for the retrieval layer that handles document search and question answering over my own files.
What worked
Document retrieval worked immediately and kept working. I can ask the system a question about a lease and it will pull the relevant clauses from across hundreds of PDFs with reasonable accuracy. This alone saves several hours a week.
Financial reconciliation assistance worked well for rule-based tasks: identifying mismatches between bank statements and accounting records, flagging transactions above certain thresholds, and generating summary reports in a format I can review quickly.
The system is also genuinely faster than I expected for most tasks. Inference time on the RTX 3060 is fast enough that responses feel immediate for short queries.
What broke
Three things broke in ways I did not anticipate.
First, the quantized models are noticeably weaker than the cloud equivalents for anything requiring complex reasoning across multiple documents simultaneously. Asking the system to synthesize information from five different lease agreements and identify a potential conflict takes significantly more prompt engineering than it would on a larger model.
Second, context length is a real constraint at 12 GB VRAM. For long documents, I have to chunk them, and the chunking creates edge cases where relevant information spans a chunk boundary. I have spent more time on chunking strategy than I expected.
Third, model updates are a maintenance burden I underestimated. When a better base model comes out, upgrading the local system is not a simple download. It requires testing, adjusting the system prompts, and recalibrating the retrieval layer. I have done this three times in six months. Each time took a full day.
What I would do differently
I would have started with a narrower scope. The first mistake was trying to do too much at once. The retrieval and reconciliation use cases were ready for production within two weeks. The more complex reasoning tasks were not, and I wasted time on those early when I should have consolidated around the working use cases first.
I would also have invested earlier in a proper test set. I now have a set of document queries with known correct answers that I use to evaluate model changes before deploying them. I built this after the second model upgrade broke some edge cases I had not noticed. Building it first would have saved several hours of debugging.
The system is now genuinely useful. But useful in a narrower way than I originally imagined, and useful because I scoped it down to the things it does reliably.
Nothing here is advice on how you should build your systems. This is an account of my own experience. Every deployment is different.