Direct answer
A serious AI studies-research pipeline harvests from 4 sources (PubMed/E-utils, Europe PMC, OpenAlex, iCite), deduplicates by DOI/PMID, scores each study by RCR impact + study type + retraction status, and enables semantic search via embeddings. Hallucinated studies (common with vanilla ChatGPT) become technically impossible this way.
Deep dive
The 4 data sources
A complete pipeline combines four complementary sources — each on its own has blind spots:
| Source | Strength | Coverage | Blind spot |
|---|---|---|---|
| PubMed (E-utils API) | peer-reviewed medicine, MeSH indexing | ~36M | non-medical fields |
| Europe PMC | preprints (bioRxiv/medRxiv), open-access full text | ~45M | same medical focus as PubMed |
| OpenAlex | interdisciplinary, concept tagging, author IDs | ~250M | weaker quality filter |
| iCite (NIH) | Relative Citation Ratio per PMID | PMIDs only | biomedical only |
Dedup strategy: DOI is the primary key, PMID as fallback. For OpenAlex hits without DOI/PMID we use title hash + author list as a tertiary key.
What happens after harvesting?
Four automated enrichments per study:
- Study-type classification — regex-based on the abstract (looking for "randomized controlled trial", "meta-analysis", "cohort study", "case report") plus validation against PubMed's structured
Publication Typefield. - Retraction check — weekly reconciliation against the Retraction Watch Database. Flagged studies get a visible ⚠️ marker.
- Impact scoring — RCR score from iCite. Studies with RCR >5 get the "landmark" badge (about 7% of all studies).
- Embedding — the abstract is run through a sentence-embedding model (today: local Ollama; previously OpenAI ada-002). Result: a 768-dim vector per study, stored in a vector DB.
Semantic vs. keyword search
Keyword search finds what you typed. Semantic search finds what you meant.
Example: a query "method against oxidative stress" via keyword search only returns hits that contain those exact words. Via semantic search (cosine similarity between query embedding and study embedding) it returns:
- studies on antioxidants (vitamin C, E, NAC)
- studies on glutathione synthesis
- studies on mitochondrial function
- studies on hormesis and adaptation
- studies on astaxanthin, quercetin, resveratrol
That's the difference between "I get what I already know" and "I discover what I didn't know".
Where the hype lies — or the methodology is weak
Three common failures in AI studies-research tools:
- Pure LLM with no tool use. When ChatGPT cites a study without a PubMed API call, the study is potentially hallucinated. Test: ask the bot for the DOI and verify it manually on doi.org.
- Stale snapshots. Some tools refresh PubMed only quarterly. You miss fresh meta-analyses. Ask: "when was your dataset last updated?"
- Missing retraction filter. Studies like the Wakefield vaccine paper have been retracted since 2010, but generative-AI models without a retraction check still cite them.
Methodology — how we judge this
On biohacking-ai.com/studien-karte you'll find the end product: a 3D visualisation of ~300 000 studies, cluster layout by topic embedding, filters for RCR/year/study type. The map is intentionally noindex (crawl budget), but the methodology you're reading here describes exactly what runs underneath.
For the concrete research how-to — i.e. "how do I use PubMed myself" — see the linked answer.
Sources
- Hutchins et al. 2016 — Relative Citation Ratio (RCR) PMID 27599104 — the methodology behind iCite's RCR score
- Priem et al. 2022 — OpenAlex: A fully-open index of scholarly works PMID 35420540 — background on the OpenAlex database
- Reimers & Gurevych 2019 — Sentence-BERT PMID 31921908 — the architecture that makes semantic studies search possible
Related answers
See below — auto-generated via relatedAnswers.