So, I have reasoned. Basically, we Old School Hackers hiding from being mixed up with bored admins coding viruses and stealing passwords of their managers, could separate now fully from Start-Ups, Corporate Developers and even MIT experts, using clean Terminal, Folder, File structures, but powerful with Ollama LLMs that are RAGed fully using folder based vector-database chunks about the project parts held in .json files.
We than have a shell that "thinks" the project by being RAGed about its content while using coding AI tools parallel to it being RAGed on purely coding.
That means running several LLMs in parallel, which is the very idea of the Cyberdeck Nexus and Cortex Versions.
Project Brain AI Shell Advice
I asked myself and need your advice. Having a coding project based on python that is no clean GUI based neat app, but a constantly evolving hacker tool, not cracker, but real hacker tool, in the Knowledge Management and AI field, I eventually needed to create a script that analyses the folders for scripts and creates an .md file about what they do to keep oversight or get back into a niche of the system.
I now wondered if having plenty of folders with scrips and whoiswho LLM generated files, each folder could hold a vectordatabse part in .json files to create a project brain like AI shell?
We need to parse the user's request. The user described a situation: they have a coding project (Python) that's a "constantly evolving hacker tool" in Knowledge Management and AI. They created a script to analyze folders of scripts and generate a .md file describing what each script does, for oversight. Now they're wondering about having "plenty of folders with scripts and whoiswho LLM generated files", each folder could hold a "vectordatabse part in .json files" to create a "project brain like AI shell".
The user says "I asked myself and need your advice." So they're asking for advice on that idea. They want to know if this idea of having folders with scripts and LLM-generated description files, each folder holding vector database parts in .json files, could create something like a project brain (an AI shell). Essentially, they want to build an AI assistant that knows about their project by indexing scripts and metadata (maybe the .md descriptions) into a vector database distributed across folders in .json files. The phrase "vectordatabse part in .json files" suggests they want to store vector embeddings and maybe metadata in JSON files in each folder, to avoid a centralized database, so that the "brain" can be queried locally? Or to create a distributed knowledge base that's version-controlled and folder-oriented.
The user might be asking: Is this a good idea? How would I implement it? What are the pitfalls? Should I use a local vector database (like Chroma, LanceDB, FAISS) with persistent storage, or store embeddings as JSON files per folder? What's the best way to create a "project brain" that can answer questions about the codebase?
They mention "whoiswho LLM generated files" – maybe they have files that describe each script's purpose, authorship, etc. They want to incorporate those into the vector DB.
The core question: "I now wondered if having plenty of folders with scrips and whoiswho LLM generated files, each folder could hold a vectordatabse part in .json files to create a project brain like AI shell?" So they want advice on if storing vector database parts in .json files in each folder is feasible to create an AI shell (an interactive AI that knows about the project). Likely they want to query the project, e.g., "Which script handles data preprocessing?" and get an answer, using RAG over the embeddings.
I need to provide advice: evaluate the idea, discuss pros and cons, suggest alternatives, and maybe outline an implementation approach. Since it's a hacker tool, they likely appreciate lightweight, self-contained solutions. I should address the question directly.
I'll structure the answer:
Clarify what they mean: using JSON files per folder to store embeddings and metadata of the scripts/descriptions within that folder, to build a distributed vector "database" that can be queried as an AI shell. This is essentially a file-based vector index.
Evaluate feasibility: Yes, it's possible. You could generate embeddings for each script/description using a local model (sentence-transformers, or llama.cpp embeddings) and store them along with file paths and summaries in a JSON file per folder. Then to query, you'd load all JSON files, compute query embedding, and find nearest neighbors by cosine similarity on the fly. This works for small-medium scale (<10k files). It avoids external dependencies, aligns with a "hacker" aesthetic.
Pros: Self-contained, version-controllable (JSON files can be committed to git), no server needed, portable, easy to inspect, fits the folder-centric project structure. It's a "brain" that lives within the project.
Cons: Scalability issues if the number of vectors grows large (thousands of files), loading and searching all JSONs per query can be slow. No efficient indexing (like HNSW) unless you build it yourself. Updating embeddings when scripts change requires re-indexing. Storing raw embeddings in JSON can be bulky (float lists as text). Might need chunking if scripts are long.
Suggestions for improvement: Use a lightweight embeddable vector DB like LanceDB (which stores data in a directory, could be per folder) or Chroma in persistent mode, rather than raw JSON arrays for similarity search. Or implement a simple Annoy/FAISS index saved alongside JSON but that's more complex. However, the hacker spirit might prefer a simple, hackable approach: maintain a single JSON index file for the whole project (like whoiswho.json) with embeddings, updated by scanning. That might be simpler to query.
Implementation ideas:
Use a script that walks folders, reads .py and .md files, generates description (maybe using LLM to summarize if not already present) and computes embedding (e.g., all-MiniLM-L6-v2).
Store in a JSON file like folder_index.json inside each folder, containing list of objects: {filename, type, summary, embedding: [...], path}. Or a single project_index.json.
AI shell: a CLI Python script that takes a query, loads indexes, does similarity search, retrieves top-k chunks, sends to LLM (local or API) with context to answer. Could use llama.cpp or Ollama for local LLM.
Have an "update-index" command to refresh.
Address the "project brain" concept: This is essentially a code-aware assistant that can answer questions about the codebase. It could be extended to not just retrieve file summaries but also relevant code snippets, enabling "how do I do X?" style queries.
Conclusion: It's a cool and feasible idea for a hacker tool. Recommend a pragmatic start: first collect all descriptions and embeddings into a single lightweight JSON index, test it, then decide if per-folder distribution adds value. Per-folder could be useful if the project is deeply modular and you only want to load relevant parts, but that might complicate querying.
The answer should be encouraging, practical, and respectful of the hacker ethos (self-reliance, minimal dependencies, hackable). I'll provide code snippets to illustrate.
This is a really cool idea — you’re essentially building a decentralised, folder-native memory for your project that an LLM can query. It feels like giving your tool a “brain” made of its own source code and documentation, distributed alongside the files themselves. Let’s break down whether it makes sense, and how you could pull it off in a way that stays loyal to the hacker-tool spirit.
Is the “vector DB parts in .json per folder” approach viable?
Yes, absolutely.
Storing embeddings + metadata as a .json file in each folder is a lightweight, version-controllable, no-database-required design. It’s perfect for a “constantly evolving hacker tool” because:
No server, no containers – just files on disk.
Git-friendly – you can track how the “brain” evolves.
Hackable – you can inspect, edit, or regenerate parts of it with simple scripts.
Modular – each folder keeps its own context, which fits a project that organically
Storing embeddings + metadata as a .json file in each folder is a lightweight, version-controllable, no-database-required design. It’s perfect for a “constantly evolving hacker tool” because:
No server, no containers – just files on disk.
Git-friendly – you can track how the “brain” evolves.
Hackable – you can inspect, edit, or regenerate parts of it with simple scripts.
Modular – each folder keeps its own context, which fits a project that organically
grows scripts in different corners.
You would:
Walk the project tree.
For each folder, collect the scripts (*.py) and any whoiswho LLM-generated .md
Walk the project tree.
For each folder, collect the scripts (*.py) and any whoiswho LLM-generated .md
descriptions.
Embed those pieces of text with a local model.
Dump the embeddings + references into a _brain_index.json inside that folder.
Embed those pieces of text with a local model.
Dump the embeddings + references into a _brain_index.json inside that folder.
When you want to ask the “AI shell” something, you either:
Load all _brain_index.json files at startup, or
Lazy-load only the ones you need (e.g., when a query matches certain keywords or the
Load all _brain_index.json files at startup, or
Lazy-load only the ones you need (e.g., when a query matches certain keywords or the
user specifies a domain).
This works beautifully up to a few thousand chunks. Beyond that, you’d want some approximate nearest neighbour (ANN) acceleration, but you can still hack that in later without abandoning the per-folder .json core.
What a “project brain AI shell” could look like
Think of it as a CLI assistant that knows your entire codebase at the level of “what script does what” and can answer things like:
“Where do we handle API authentication?”
“Show me all the places that touch the knowledge graph serialisation.”
“Which modules haven’t been touched in 6 months?” (by reading git metadata stored
Think of it as a CLI assistant that knows your entire codebase at the level of “what script does what” and can answer things like:
“Where do we handle API authentication?”
“Show me all the places that touch the knowledge graph serialisation.”
“Which modules haven’t been touched in 6 months?” (by reading git metadata stored
in the index)
A minimal implementation flow:
python
# 1. BUILD INDEX (run periodically)
# For each folder, generate/update _brain_index.json
{
"chunks": [
{
"file": "ingest.py",
"type": "script",
"text": "Handles loading markdown notes, chunking, and embedding…",
"embedding": [0.0123, -0.0456, ...], # from sentence-transformers
"last_modified": "2026-05-04T12:00:00"
},
{
"file": "whoiswho_ingest.md",
"type": "summary",
"text": "This folder is responsible for the ingestion pipeline…",
"embedding": [...]
}
]
}
python
# 1. BUILD INDEX (run periodically)
# For each folder, generate/update _brain_index.json
{
"chunks": [
{
"file": "ingest.py",
"type": "script",
"text": "Handles loading markdown notes, chunking, and embedding…",
"embedding": [0.0123, -0.0456, ...], # from sentence-transformers
"last_modified": "2026-05-04T12:00:00"
},
{
"file": "whoiswho_ingest.md",
"type": "summary",
"text": "This folder is responsible for the ingestion pipeline…",
"embedding": [...]
}
]
}
python
# 2. QUERY SHELL
query = "How does the ingestion handle large files?"
q_emb = embed(query)
all_chunks = []
for root, dirs, files in os.walk("."):
idx_path = os.path.join(root, "_brain_index.json")
if os.path.exists(idx_path):
with open(idx_path) as f:
folder_data = json.load(f)
all_chunks.extend(folder_data["chunks"])
# simple exhaustive cosine similarity (fine for <10k vectors)
scores = [cosine_sim(q_emb, c["embedding"]) for c in all_chunks]
top_k = sorted(zip(scores, all_chunks), reverse=True)[:5]
# 3. FEED TO LLM
prompt = build_prompt(query, top_k)
answer = local_llm(prompt) # via llama.cpp, ollama, or an API
# 2. QUERY SHELL
query = "How does the ingestion handle large files?"
q_emb = embed(query)
all_chunks = []
for root, dirs, files in os.walk("."):
idx_path = os.path.join(root, "_brain_index.json")
if os.path.exists(idx_path):
with open(idx_path) as f:
folder_data = json.load(f)
all_chunks.extend(folder_data["chunks"])
# simple exhaustive cosine similarity (fine for <10k vectors)
scores = [cosine_sim(q_emb, c["embedding"]) for c in all_chunks]
top_k = sorted(zip(scores, all_chunks), reverse=True)[:5]
# 3. FEED TO LLM
prompt = build_prompt(query, top_k)
answer = local_llm(prompt) # via llama.cpp, ollama, or an API
That’s the entire “brain”. No external vector DB, no extra services. It’s a script that creates a JSON knowledge base, and a script that queries it.
Should you really use a .json per folder, or a single index?
Should you really use a .json per folder, or a single index?
Per-folder .json makes the most sense if:
Your project is highly modular and you want to activate only relevant parts of the
Your project is highly modular and you want to activate only relevant parts of the
brain (e.g., you’re working on the “export” module, so you only load folders that match
a path prefix).
You want each folder to be self-contained so you can copy/paste a sub-project along
You want each folder to be self-contained so you can copy/paste a sub-project along
with its own “memory”.
You like the idea of the brain being physically close to the code it describes — it’s
You like the idea of the brain being physically close to the code it describes — it’s
elegant.
A single project_brain.json at the root is simpler to load and query (one file, one load) and still perfectly hackable. For a first iteration I’d actually start with a central file, and then if you feel the need, split it by folder later. But since you asked specifically about per-folder .json files, that’s totally doable and feels more aligned with a distributed/hacker aesthetic.
Pitfalls and how to hack around them
Embedding storage size
A 384‑dimensional float32 vector stored as a JSON list of floats can easily be 3–5 KB per
Embedding storage size
A 384‑dimensional float32 vector stored as a JSON list of floats can easily be 3–5 KB per
chunk (due to decimal representation). For many files this adds up.
Fix: Store embeddings as base64-encoded binary blobs, or use array.array('f') and save
Fix: Store embeddings as base64-encoded binary blobs, or use array.array('f') and save
as a compact binary file per folder alongside the JSON metadata. For the true hacker:
just keep the JSON as plain floats – storage is cheap, and it’s readable.
Search speed at scale
Exhaustive search over thousands of vectors in Python might take a few hundred
Exhaustive search over thousands of vectors in Python might take a few hundred
milliseconds. That’s totally fine for an interactive shell. If it ever becomes sluggish, you
can toss faiss or hnswlib at the vectors after loading them, building an in-memory
index. You can still persist the raw vectors in your folder .json files.
Keeping the brain fresh
A script analysing all folders every time you want to update the brain is slow. Instead,
A script analysing all folders every time you want to update the brain is slow. Instead,
use file modification timestamps or hashes to only re-embed files that have changed.
You can store checksums inside the _brain_index.json.
Who should generate the descriptions?
If your whoiswho files are already LLM-generated summaries of what each script
Who should generate the descriptions?
If your whoiswho files are already LLM-generated summaries of what each script
does, you can embed those directly. For scripts without them, you can either embed the
docstring + first N lines, or auto-summarise with a small local model (even the same
one you use for querying). The “brain” is then both retrieval and synthesis.
AI shell context window
You can’t cram the whole project into one prompt. That’s exactly what the retriever
You can’t cram the whole project into one prompt. That’s exactly what the retriever
solves: you pull the top-k most relevant chunks, stuff them into the prompt, and ask
the LLM to answer. It works surprisingly well for code‑base Q&A.
The real win: moving from “scripts everywhere” to “a conversable project”
What you’re designing here is the backbone of tools like GitHub Copilot Chat’s workspace awareness or private RAG over codebases, but built from scratch, no lock-in, no cloud. As a hacker tool, it means:
Onboarding yourself back into a dusty corner of the project in seconds.
Asking “why do we even have legacy_parser_v2.py?” and getting an answer with
What you’re designing here is the backbone of tools like GitHub Copilot Chat’s workspace awareness or private RAG over codebases, but built from scratch, no lock-in, no cloud. As a hacker tool, it means:
Onboarding yourself back into a dusty corner of the project in seconds.
Asking “why do we even have legacy_parser_v2.py?” and getting an answer with
context.
Eventually letting the AI shell write code by fetching the right existing fragments first.
Eventually letting the AI shell write code by fetching the right existing fragments first.
Your instinct to make it folder‑based and JSON‑friendly is spot on. My advice: start tonight with a single folder prototype, then generalise the walker. Keep it simple, keep it hackable, and don’t over‑engineer the vector search until it actually hurts.
If you want a nudge on the embedding model or a skeleton code layout, say the word. But honestly, you already have the concept locked down — now just build it and see how it feels to talk to your own project.
If you want a nudge on the embedding model or a skeleton code layout, say the word. But honestly, you already have the concept locked down — now just build it and see how it feels to talk to your own project.
#ubuntu
#opensource
#cyberpunkcoltoure