# cultguard-agents

`cultguard-agents` is a local, TypeScript-based OSINT investigation workbench.

It is built for collecting, storing, and reviewing evidence about online influence
operations and related social-media activity, with a current focus on Facebook
pages plus manually captured evidence.

Today, the repo gives you:

- a **PostgreSQL 16 + pgvector** investigation database
- **TypeScript CLIs** for scraping, ingestion, and search
- a **Facebook scraper** that drives a real Chromium session via CDP
- **manual evidence ingestion** for screenshots, images, text, profile JSON, and HTTP captures
- **Metabase** for local dashboards and ad-hoc analysis
- a reproducible local environment via **devenv.sh**

> Important: the public project name is `cultguard-agents`, but the internal
> PostgreSQL database name is still `cultguard` for compatibility.

---

## What this project does

At a high level, the project helps you:

1. **Collect evidence**
   - scrape Facebook page metadata, posts, comments, and ads
   - ingest screenshots, images, text, HTTP captures, and profile JSON by hand

2. **Store it in a structured investigation schema**
   - investigations
   - entities
   - content
   - media
   - relationships
   - annotations
   - claims
   - LLM audit tables

3. **Query and review it**
   - full-text search from the CLI
   - entity dumps
   - raw SQL against PostgreSQL
   - Metabase dashboards/views for exploration

4. **Extend it**
   - add new Facebook extractors
   - add new ingestion paths
   - add new schema fields/views
   - finish the embedding + semantic-search pipeline

---

## Current state of the codebase

### Working now

- `fb-scrape` can connect to a running Chromium instance over CDP
- Facebook page metadata scraping is implemented
- feed scrolling and post extraction are implemented
- optional comment scraping is implemented
- ad library scraping is implemented
- manual ingestion CLIs are implemented
- PostgreSQL full-text search is implemented
- entity inspection and raw SQL querying are implemented
- the repo currently **typechecks cleanly** with `typecheck` inside `devenv shell`

### Not finished yet

- `src/embed.ts` is still a **stub**
  - it finds rows that need embeddings
  - it does **not** yet generate real vectors
- semantic query search in `src/lib/search/semantic.ts` is also a **stub**
  - use `search --type fts` today
- individual reaction persistence is **not fully wired**
  - reactions can be collected during scraping
  - but `fb-scrape` does not currently call `storeReactions()` during save

### Legacy / archival parts

- `ingestion/` contains older Python migration scripts for legacy SQLite data
- the active runtime in this repo is the **TypeScript code under `src/`**
- `data/schema.sql` is the authoritative database schema

---

## Architecture

```text
               ┌──────────────────────────────┐
               │      Chromium (logged in)    │
               │   remote debugging :9222     │
               └──────────────┬───────────────┘
                              │ CDP
                              ▼
┌──────────────┐    ┌──────────────────────────────┐
│  fb-chrome   │    │          fb-scrape           │
│ starts local │    │ page flow + network capture  │
│ Chromium     │    │ GraphQL extraction + save    │
└──────────────┘    └──────────────┬───────────────┘
                                   │
                                   ▼
                         ┌──────────────────────┐
                         │ PostgreSQL: cultguard│
                         │ investigations,      │
                         │ entities, content,   │
                         │ media, claims, etc.  │
                         └──────────┬───────────┘
                                    │
              ┌─────────────────────┼─────────────────────┐
              ▼                     ▼                     ▼
      ┌──────────────┐      ┌──────────────┐      ┌──────────────┐
      │    ingest    │      │    search    │      │   Metabase   │
      │ manual files │      │ FTS / SQL /  │      │ dashboards & │
      │ into DB      │      │ entity dump  │      │ exploration  │
      └──────────────┘      └──────────────┘      └──────────────┘
```

All local runtime state stays inside `.devenv/`, including:

- PostgreSQL socket/data
- Chromium profile used for scraping
- Metabase state
- model cache location reserved for future embedding work

---

## Quick start

### Prerequisites

- [devenv.sh](https://devenv.sh/getting-started/)
- Nix enabled on the machine
- enough disk for local services and investigation data
- a Facebook-capable Chromium session if you want to use `fb-scrape`

### First-time setup

Use two terminals.

#### Terminal 1: start services

```bash
devenv up
```

This starts the local services declared in `devenv.nix`, including PostgreSQL
and Metabase.

#### Terminal 2: enter the pure dev shell

```bash
devenv shell
```

Node dependencies are linked from the Nix store using `package-lock.json`, so no `npm install` is required.

Then initialize the database schema:

```bash
db-setup
```

Sanity-check the environment:

```bash
typecheck
db-stats
```

### Optional: project-local privacy desktop runtime

The repo now vendors the privacy-desktop infrastructure that used to live in
`~/dotfiles`, including:

- Pi `0.65.0` packaging
- pinned privacy-browser ref (historical filename: `cultguard-chromium.ref`) now sourced from `cultguard-chrome`
- unprivileged Mullvad via `wireproxy`
- fingerprint-guard browser extension
- Xvfb + Openbox + VNC desktop runtime
- SOPS-encrypted secrets and `.sops.yaml`

The runtime currently defaults to the browser package exposed by `cultguard-chrome`.
The local ref file keeps its historical `cultguard-chromium.ref` name for
compatibility with existing tooling.

Typical flow:

```bash
devenv shell
privacy-profile-create analyst-1
privacy-profile-attach analyst-1 1
privacy-browser-test-all
privacy-browser-up --instance 1 --wg-config /path/to/mullvad.conf
```

That starts a single configurable privacy desktop by default instead of forcing
all desktops to run in parallel.

Useful helpers:

```bash
privacy-profile-list
privacy-profile-show analyst-1
privacy-profile-attach analyst-1 1
privacy-browser-status --instance 1
privacy-browser-down --instance 1
privacy-browser-test ~/.local/state/mullvad-wg/wg-mullvad-1.conf
privacy-browser-test-all
privacy-secrets-render rustdesk-password
```

The privacy desktop runtime no longer exposes CDP. It packages the browser,
Mullvad tunnel, and remote desktop surface only.

The active path is now fully unprivileged: `wireproxy` imports a Mullvad
WireGuard config and exposes a local SOCKS5 proxy, and the packaged Chromium is
forced to use that proxy automatically.

Browser profiles are now self-contained under `.devenv/state/privacy-stack/profiles`.
You can create named profiles and attach them to instance numbers so a given
instance reuses the same persistent Chromium state without touching host-level
browser directories.

---

## Main workflows

### 1) Scrape a Facebook page

Start Chromium with remote debugging enabled:

```bash
fb-chrome
```

Log into Facebook in that Chromium window if needed.

Then, from another shell inside the repo:

```bash
fb-scrape "https://www.facebook.com/people/لبنان-يتحرر/61585153052901/" \
  --depth 25 \
  --comments \
  --ads \
  --save \
  -i lebanon-liberates-2026
```

Useful flags:

- `--depth <n>`: max posts to collect
- `--since YYYY-MM-DD`: stop when posts become older than a cutoff
- `--comments`: collect comment threads
- `--reactions`: collect reaction lists in memory
- `--ads`: scrape the Meta Ad Library for that page
- `--save`: persist structured results to PostgreSQL
- `-i, --investigation <id>`: target investigation ID
- `-p, --port <n>`: CDP port, default `9222`

#### What gets saved when `--save` is used

- page → `entities`
- posts/comments/ads → `content`
- page identifiers like phone/email → `identifiers`
- transparency metadata → merged into `entities.raw_json`
- downloaded media → `media`

#### Current caveats

- `--reactions` is not fully persisted yet; the save flow does not currently wire
  reaction lists into `storeReactions()`.
- `fb-scrape` still downloads media after collection. If you want a truly pure
  dry-run mode, `src/fb-scrape.ts` is the place to tighten that behavior.
- scraper stability depends on current Facebook UI and GraphQL response shapes.

---

### 2) Manually ingest evidence

The `ingest` CLI is the fastest way to add evidence that was captured outside
of the scraper.

#### Ingest a screenshot

```bash
ingest --type screenshot \
  --file screenshots/ad_library_about.png \
  --note "Ad Library screenshot" \
  --investigation lebanon-liberates-2026
```

#### Ingest an image and attach it to an entity

```bash
ingest --type image \
  --file media/profile_pic.jpg \
  --entity-id fb_page_61585153052901 \
  --investigation lebanon-liberates-2026
```

#### Ingest raw text for an entity

```bash
ingest --type text \
  --entity-id fb_page_61585153052901 \
  --text "النص هنا" \
  --lang ar \
  --investigation lebanon-liberates-2026
```

#### Ingest a profile JSON file

```bash
ingest --type profile \
  --file raw_data/mounetbeyti_entity.json \
  --investigation lebanon-liberates-2026
```

`profile` ingestion expects a JSON file with one of:

- `id`
- `uid`
- `entity_id`

#### Ingest an HTTP capture JSON file

```bash
ingest --type http \
  --file raw_data/capture.json \
  --note "Captured GraphQL response" \
  --investigation lebanon-liberates-2026
```

For HTTP captures, large response bodies are truncated in `http_log` and the
full JSON payload is archived under `archives/`.

---

### 3) Search and inspect the data

#### Database stats

```bash
search --stats
```

#### Full-text search

```bash
search "hezbollah" --type fts --limit 10
search "Google Form" --type fts
search "تعليقات مشبوهة" --type fts
```

#### Dump everything known about one entity

```bash
search --entity fb_page_61585153052901
```

#### Run raw SQL

```bash
search --sql "SELECT id, name, post_count, media_count FROM v_entity_summary LIMIT 5"
search --sql "SELECT * FROM v_network_edges ORDER BY strength DESC LIMIT 20"
```

#### Open psql directly

```bash
db-shell
```

> Use `--type fts` for real search today. Semantic search via embeddings is not
> finished yet.

---

### 4) Use Metabase for dashboards

After `devenv up`, Metabase is available at:

- `http://localhost:3100`

Recommended workflow:

1. complete the first-time Metabase setup
2. connect it to the local `cultguard` PostgreSQL database
3. explore the provided views in `data/schema.sql`

Useful views:

- `v_entity_summary`
- `v_network_edges`
- `v_content_timeline`
- `v_claims_summary`
- `v_embedding_coverage`
- `v_llm_cost_summary`

#### JDBC connection string

Use an absolute path for the socket directory. Inside the repo shell:

```bash
printf '%s\n' "$DEVENV_ROOT/.devenv/state/postgres"
```

Then use that path in a JDBC URL like:

```text
jdbc:postgresql://localhost/cultguard?host=/absolute/path/to/repo/.devenv/state/postgres&port=5433
```

Username can usually be your local system user; password is typically blank for
this local socket-based setup.

---

## Repository guide

### Active TypeScript entrypoints

- `src/fb-scrape.ts` — Facebook page scraper CLI
- `src/ingest.ts` — manual ingestion CLI
- `src/search.ts` — FTS, entity dump, stats, raw SQL
- `src/embed.ts` — embedding pipeline scaffold
- `src/db.ts` — PostgreSQL pool/transaction helpers
- `src/types.ts` — shared TypeScript types for the DB + Facebook objects

### Facebook scraper internals

- `src/lib/fb/cdp.ts` — connect to Chromium over CDP
- `src/lib/fb/network.ts` — intercept GraphQL responses
- `src/lib/fb/flows/` — page/feed/post-detail/ad-library flows
- `src/lib/fb/extract/` — response parsers/extractors
- `src/lib/fb/store.ts` — persist scraped data into PostgreSQL
- `src/lib/fb/media.ts` — download and deduplicate media

### Manual ingestion internals

- `src/lib/ingest/http.ts`
- `src/lib/ingest/image.ts`
- `src/lib/ingest/text.ts`
- `src/lib/ingest/profile.ts`

### Search internals

- `src/lib/search/fts.ts`
- `src/lib/search/entity.ts`
- `src/lib/search/semantic.ts`

### Schema and environment

- `data/schema.sql` — authoritative PostgreSQL schema + views
- `devenv.nix` — local services, scripts, environment variables
- `package.json` — Node scripts and dependencies

### Legacy / supporting directories

- `ingestion/` — old Python migration helpers for SQLite → PostgreSQL
- `raw_data/` — raw JSON and source captures
- `analysis/` — generated analysis notes/reports
- `screenshots/` — screenshots used as evidence
- `media/` — downloaded or ingested media files
- `archives/` — archived oversized HTTP payloads

---

## Data model summary

The schema is investigation-centric.

Key tables:

- `investigations` — top-level case namespace
- `entities` — pages, profiles, orgs, phones, domains, etc.
- `content` — posts, comments, ads, bios, captions
- `media` — images, videos, screenshots, thumbnails
- `relationships` — links between entities
- `identifiers` — phone, email, username, domain, etc.
- `annotations` — analyst/LLM notes
- `claims` — structured assertions with confidence + evidence
- `llm_runs` — audit trail for model calls
- `timeline_events` — denormalized timeline for analysis tools
- `text_embeddings` / `image_embeddings` — vector tables reserved for the future embedding pipeline

`data/schema.sql` also defines Metabase-friendly views for summary tables,
network edges, timelines, embedding coverage, and LLM cost tracking.

---

## Debugging guide

### Fast sanity checks

Run these first before debugging deeper:

```bash
devenv shell
typecheck
db-stats
search --sql "SELECT COUNT(*) FROM investigations"
```

If you just changed the schema:

```bash
db-setup
```

---

### Common failure modes

| Problem | What to check |
|---|---|
| `db-shell` / `search` cannot connect to PostgreSQL | Make sure `devenv up` is running, then enter `devenv shell`, then run `db-setup`. The shell should provide `PGHOST` and `PGPORT`. |
| `fb-scrape` cannot connect to Chromium | Start Chromium with `fb-chrome`. Confirm it is listening on port `9222`. |
| Chromium opens but scraping returns very little | Make sure the target page is actually visible in the logged-in session. Start with `--depth 5` and no extras. |
| Feed/ad/comment extraction suddenly stops working | Facebook likely changed a GraphQL shape or UI behavior. Inspect `src/lib/fb/extract/*` and `src/lib/fb/flows/*`. |
| `search --type text` or `--type image` gives placeholder output | Expected for now. Semantic query embedding is not implemented yet. Use `--type fts`. |
| `db-embed` prints stub messages | Expected. `src/embed.ts` still needs real ONNX / transformers wiring. |
| Metabase cannot connect | Use an absolute socket path in the JDBC URL and confirm PostgreSQL is up. |

---

## Where to change code when extending the system

### Add or fix scraped fields from Facebook

Touch these in order:

1. `src/lib/fb/extract/*` — parse the GraphQL/DOM payload
2. `src/types.ts` — add/update TS types
3. `src/lib/fb/store.ts` — persist into the DB
4. `data/schema.sql` — only if the DB schema must change
5. `v_*` views in `data/schema.sql` — if the new field should appear in Metabase

### Change scraper behavior

- navigation / click flow: `src/lib/fb/flows/*`
- CDP behavior: `src/lib/fb/cdp.ts`
- network capture: `src/lib/fb/network.ts`
- human-like timing/scrolling: `src/lib/fb/human.ts`

### Add a new ingestion mode

- CLI flags: `src/ingest.ts`
- implementation: `src/lib/ingest/*`
- schema/types if needed: `data/schema.sql`, `src/types.ts`

### Change search behavior

- FTS: `src/lib/search/fts.ts`
- entity dump: `src/lib/search/entity.ts`
- semantic search scaffold: `src/lib/search/semantic.ts`

---

## Recommended next improvements

If you are continuing development, these are the highest-value next steps:

1. **Finish embeddings**
   - wire `@huggingface/transformers` + `onnxruntime-node` into `src/embed.ts`
   - write actual vectors into `text_embeddings` and `image_embeddings`

2. **Implement semantic query search**
   - generate query embeddings in `src/lib/search/semantic.ts`
   - search against `pgvector` indexes

3. **Persist reaction details properly**
   - call `storeReactions()` from `fb-scrape.ts`
   - decide whether to store only aggregates or individual reactors too

4. **Make scraper debugging easier**
   - add optional debug logging for GraphQL operation names
   - save representative payload fixtures for extractor regression testing

5. **Add a true dry-run mode**
   - right now, media download behavior is not fully separated from DB writes

6. **Add automated tests**
   - there is no real automated test suite yet
   - today, `typecheck` + manual smoke tests are the main guardrails

---

## Useful commands reference

```bash
# environment

devenv up
devenv shell
typecheck

# database

db-setup
db-shell
db-stats
db-dump
db-restore data/cultguard-YYYYMMDD-HHMMSS.sql

# ingestion / scraping

ingest --type screenshot --file screenshots/example.png --investigation demo
fb-chrome
fb-scrape "https://www.facebook.com/people/..." --depth 10 --save -i demo

# search

search --stats
search "keyword" --type fts
search --entity fb_page_123456789
search --sql "SELECT * FROM v_entity_summary LIMIT 10"

# embeddings (currently scaffold only)

db-embed
```

---

## Notes

- This repo is designed to run **locally**, not as a hosted multi-user service.
- The operational center of gravity is the database schema in `data/schema.sql`.
- If the README and the code ever disagree, prefer the code under `src/` and the
  schema under `data/schema.sql`.
