Multimodal Healthcare Foundation Model
Mission: Design of an Artificial Intelligence [AI-built] Healthcare Foundation Model driven by and derived from Medical Expert Content generated by LPBI Group’s Experts, Authors, Writers (EAWs) used as Training Data for the Model
Updated on 9/6/2025
Positioning of LPBI Group’s Portfolio of Digital IP Assets as
1. Multimodal
1.1 Journal Ontology and Curation articles [Text]
1.2 e-Books English & Spanish Editions [Text, Images, Audio]
2. Training Data for AI Foundation Models – Big Data: 9 Giga Bytes
3. Expert-created Content and Curations versus Experiment Results Reporting and/or Clinical Trials Results Reporting
4. Vertical focus: HEALTHCARE Domain-Aware
5. Tested with NLP for LLM
Results of Medical Text Analysis with Natural Language Processing (NLP): Latest in Genomics Methodologies for Therapeutics: Gene Editing, NGS & BioInformatics, … de la investigación genómica Book 3)2022
(Spanish Edition) Kindle Edition
Book is in English, Text & Graphics (No audio)
Text Analysis with machine learning of the content of Volume 2 in the English Edition
https://www.amazon.com/dp/B0BRD8JXTL $155
AND
Serie C, Volumen 1
Biología y genómica del cáncer para el diagnóstico de la enfermedad (Serie C: libros electrónicos acerca del cáncer y la oncología nº 1) 2022
In book’s PART B: Computer Graphics and English Text
The graphical results of Medical Text Analysis with Machine Learning (ML), Deep Learning (DL) and Natural Language Processing (NLP) algorithms AND the Domain Knowledge Expert (DKE) interpretation of the results in Text format
Appendix to PART B – Computer Code
(Spanish Edition) Kindle Edition
http://www.amazon.com/dp/B0BQHFXKHH $156
- Foundation Model in Healthcare i.e., Alphabet’s DeepMinds, NVIDIA, Google’s Verily, Microsoft – Healthcare, Oracle – Healthcare, Anthropic
- Is/will be a WINNER
The DATA IS
- Privately-held not like PubMed in the Public Domain already used and exhausted by all AI companies
- We are Debt FREE
- Nine Giga Bytes of Digital Data is in the Cloud: Journal and the rest IP Assets are on the Cloud of WordPress.com
- All 48 published books are on Amazon.com
- Royalties are deposited every 90 days by Amazon to LPBI Group’s Citizens Bank Account in Newton, MA
3, 4, 5, above make Transfer of Ownership an easy act.
Updated on 8/12/2025
An Example of using AI tools for Content Interpretation of an original Journal article text
Interaction of enzymes and hormones – Original Text
Sudipta Saha, PhD
The majority of living forms depend for their functioning upon two classes of biocatalysts, the enzymes and the hormones. These biocatalysts permit the diverse chemical reactions of the organism to proceed at 38°C with specificity and at rates frequently unattainable in vitro at elevated temperatures with similar reactants. The physiologic importance of enzymes and hormones is evident not only under normal circumstances, but is reflected clinically in the diverse descriptions of errors of metabolism, due to lack or deficiency of one or more enzymes, and the numerous hypo and hyper functioning states resulting from imbalance of hormonal supply.
In as much as both enzymes and hormones function, with rare exception, to accelerate the rates of processes in cells, investigators have sought possible interrelationships and interactions of enzymes and hormones, particularly as a basis for the mechanism of hormonal action. It has seemed logical to hypothesize that hormones, while not essential for reactions to proceed but never the less affecting the rates of reactions, may function by altering either the concentration or activity of the prime cellular catalysts, the enzymes. This proposed influence of hormones on enzymatic activity might be a primary, direct effect achieved by the hormone participating as an integral part of an enzyme system, or an indirect influence based upon the hormone altering the concentration of available enzyme and/or substrate utilized by a particular enzyme. Many publications have described alterations in the activity of enzymes in various tissues following administration in vivo of diverse hormonal preparations. However, it is not possible to judge, in the in vivo experiments, whether the reported effects are examples of direct enzyme-hormone interaction, or an indirect influence of the hormone mediated via one or more metabolic pathways, and therefore other enzyme systems whose activities are not being measured. Data from in-vivo studies of this type are thus not pertinent to a discussion of direct hormone-enzyme interaction.
It is plausible that enzymes in a pathway might be structurally conserved because of their similar substrates and products for linked metabolic steps. However, this is not typically observed, and sequence analysis confirms the lack of convergent or divergent evolution. One might postulate that, if the folds or overall structures of the enzymes in a pathway are not conserved, then perhaps at least pathway-related active site similarities would exist. It is true that metal-binding sites and nucleotide-binding sites are structurally conserved. For example, cofactor-binding motifs for zinc, ATP, biopterin and NAD have been observed and biochemically similar reactions appear to maintain more structural similarity than pathway-related structural motifs. In general, ‘horizontal’ structural equivalency is prevalent in that chemistry-related structural similarities exist, but ‘vertical’ pathway-related structural similarities do not hold.
For metabolic pathways, protein fold comparisons and corresponding active site comparisons are sometimes possible if structural and functional homology exists. Unfortunately, with the current structural information available, the majority of active sites that can be structurally characterized are not similar within a metabolic pathway. Other examples exist of nearly completed pathways, for example, the tricarboxylic acid (TCA) cycle, and similar observations are observed. Situations in which different metals are incorporated in enzyme active sites lead to inherently different catalytic portions of the active sites. Slight differences in the ligand-binding portions of the respective active sites must lead to the observed differences in pathway-related enzyme specificities. These modifications in enzymatic activity are similar to what Koshland and co-workers previously observed. They showed that very minor active site perturbations to isocitrate dehydrogenase had drastic effects on catalysis.
The availability of the human genomic sequence is changing the way in which biological questions are addressed. Based on the prediction of genes from nucleotide sequences, homologies among their encoded amino acids can be analyzed and used to place them in distinct families. This serves as a first step in building hypotheses for testing the structural and functional properties of previously uncharacterized paralogous genes. As genomic information from more organisms becomes available, these hypotheses can be refined through comparative genomics and phylogenetic studies. Instead of the traditional single-gene approach in endocrine research, we are beginning to gain an understanding of entire mammalian genomes, thus providing the basis to reveal subfamilies and pathways for genes involved in ligand signaling. The present review provides selective examples of postgenomic approaches in the analysis of novel genes involved in hormonal signaling and their chromosomal locations, polymorphisms, splicing variants, differential expression, and physiological function. In the postgenomic era, scientists will be able to move from a gene-by-gene approach to a reconstructionistic one by reading the encyclopedia of life from a global perspective. Eventually, a community-based approach will yield new insights into the complexity of intercellular communications, thereby offering us an understanding of hormonal physiology and pathophysiology. Many cellular signaling pathways ultimately control specific patterns of gene expression in the nucleus through a variety of signal-regulated transcription factors, including nuclear hormone receptors. The advent of genomic technologies for examining signal-regulated transcriptional responses and transcription factor binding on a genomic scale has dramatically increased our understanding of the cellular programs that control hormonal signaling and gene regulation. Studies of transcription factors, especially nuclear hormone receptors, using genomic approaches have revealed novel and unexpected features of hormone-regulated transcription, and a global view is beginning to emerge.
- Insulin
“Insulin activates hexokinase to promote glucose phosphorylation”
- Cortisol
“Cortisol upregulates PEPCK expression in gluconeogenesis”
- Thyroid hormone
“Thyroid hormone enhances CPT1 activity in fatty acid oxidation”
Updated on 7/31/2025
January 2025
Large Language Models for Bioinformatics
arXiv:2501.06271v1 [q-bio.QM]
10 Jan 2025
Wei Ruan∗1, Yanjun Lyu∗2, Jing Zhang∗2, Jiazhang Cai3, Peng Shu1, Yang Ge4, Yao Lu4, Shang Gao5, Yue Wang1, Peilong Wang6, Lin Zhao1, Tao Wang3, Yufang Liu3, Luyang Fang3, Ziyu Liu3, Zhengliang Liu1, Yiwei Li1, Zihao Wu1, Junhao Chen1, Hanqi Jiang1, Yi Pan1, Zhenyuan Yang1, Jingyuan Chen6, Shizhe Liang7, Wei Zhang8, Terry Ma9, Yuan Dou10, Jianli Zhang10, Xinyu Gong10, Qi Gan10, Yusong Zou10, Zebang Chen10, Yuanxin Qian10, Shuo Yu10, Jin Lu1, Kenan Song10, Xianqiao Wang10, Andrea Sikora11, Gang Li12, Xiang Li13, Quanzheng Li13, Yingfeng Wang14, Lu Zhang15, Yohannes Abate16, Lifang He17, Wenxuan Zhong3, Rongjie Liu3, Chao Huang4, Wei Liu6, Ye Shen4, Ping Ma3, Hongtu Zhu5, Yajun Yan10, Dajiang Zhu†2, and Tianming Liu†1
1 School of Computing, University of Georgia, GA, USA
2 Department of Computer Science and Engineering, University of Texas at Arlington, TX, USA
3 Department of Statistics, University of Georgia, GA, USA
4 Department of Epidemiology and Biostatistics, University of Georgia, Athens, GA, USA
5 Department of Biostatistics, UNC Chapel Hill, NC, USA
6 Department of Radiation Oncology, Mayo Clinic, Phoenix, AZ, USA
7 Institute of Plant Breeding, Genetics & Genomics, University of Georgia, Athens, GA, USA
8 School of Computer and Cyber Sciences, Augusta University, GA, USA
9 School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
10 College of Engineering, University of Georgia, Athens, GA, USA
11 Department of Biomedical Informatics, University of Colorado, CO, USA
12 Department of Radiology, University of North Carolina at Chapel Hill, NC, USA
13 Department of Radiology, Massachusetts General Hospital and Harvard Medical School, MA, USA
14 Department of Computer Science and Engineering, University of Tennessee at Chattanooga, TN, USA
15 Department of Computer Science, Indiana University Indianapolis, IN, USA
16 Department of Physics and Astronomy, University of Georgia, Athens, GA, USA
17 Department of Computer Science and Engineering, Lehigh University, PA, USA
Abstract
With the rapid advancements in large language model (LLM) technology and the emergence of bioinformatics-specific language models (BioLMs), there is a growing need for a comprehensive analysis of the current landscape, computational characteristics, and diverse applications. This survey aims to address this need by providing a thorough review of BioLMs, focusing on their ∗Co-first authors. †Corresponding authors. 1evolution, classification, and distinguishing features, alongside a detailed examination of training methodologies, datasets, and evaluation frameworks. We explore the wide-ranging applications of BioLMs in critical areas such as disease diagnosis, drug discovery, and vaccine development, highlighting their impact and transformative potential in bioinformatics. We identify key challenges and limitations inherent in BioLMs, including data privacy and security concerns, interpretability issues, biases in training data and model outputs, and domain adaptation complexities. Finally, we highlight emerging trends and future directions, offering valuable insights to guide researchers and clinicians toward advancing BioLMs for increasingly sophisticated biological and clinical applications.
5 Drug Discovery and Development
5.1 Drug Target Identification
Drug discovery is a resource-intensive and time-consuming process, often spanning 7 to 20 years from initial development to market approval [331, 332]. Central to this process is drug-target interaction (DTI) identification, which involves pinpointing molecules implicated in disease mechanisms. Traditional methods, including genomics, proteomics, RNAi, and molecular docking, have been instrumental but face limitations in cost, scalability, and adaptability to complex biological systems. Recent advancements in computational techniques, such as machine learning[333, 334, 335], knowledge graph-based methods[336, 337], and molecular docking simulations, driven by the rapid growth of large-scale biomedical datasets[338, 339, 340], have significantly advanced DTI prediction. Beyond these methods, recent breakthroughs in LLMs and BioLMs represent a paradigm shift, enabling the integration and analysis of vast, heterogeneous datasets—including molecular data, biological networks, and scientific literature—while storing drug-related background knowledge through extensive pretraining [341, 342, 343, 344, 345]. This section provides an overview of LLM-based approaches for DTI prediction, categorized based on the type of data they utilize: sequence data, structural data, and relationship data, with the latter primarily derived from knowledge graphs. Sequence data, including amino acid sequences for proteins and Simplified Molecular Input Line Entry System (SMILES) representations for drugs, plays a central role in single-modal methods for DTI prediction. Pretrained language models (PLMs), such as PharmBERT[346], BioBERT[347], and ProteinBERT[348], have been widely utilized to extract meaningful representations from such data, enabling efficient and accurate predictions. For instance, DTI-LM [349] addresses the cold-start problem by utilizing PLMs to predict DTIs based solely on molecular and protein sequences, enabling accurate predictions for novel drugs and uncharacterized targets. Similarly, ConPLex [350] generates co-embeddings of drugs and target proteins, achieving broad generalization to unseen proteins and over 10× faster inference compared to traditional sequence-based methods, making it ideal for tasks like drug repurposing and high-throughput screening. Yang et al. [351] further enhance DTI prediction by introducing high-frequency amino acid subsequence embedding and transfer learning, capturing functional interaction units and shared features across large datasets. Additionally, TransDTI [352] employs transformer-based language models to classify drug-target interactions into active, inactive, or intermediate categories, offering competitive performance. Despite their advantages, single-modal methods are limited by their reliance on sequence data alone, making it challenging to capture interactions involving spatial, structural, or contextual dependencies. To address the limitations of single-modal approaches, multimodal frameworks integrate diverse data types—such as molecular graphs, protein sequences, and structural data—offering a more comprehensive understanding of DTIs. DrugLAMP[353] exemplifies this integration, utilizing Pocket-Guided Co-Attention (PGCA) and Paired Multi-Modal Attention (PMMA) to fuse molecular graphs with sequence data, achieving nuanced molecular interaction predictions. PGraphDTA [354] incorporates 3D contact maps alongside protein sequences, outperforming sequence-only 27methods when structural data is available. Beyond predictive accuracy, multimodal frameworks like CGPDTA[355] enhance interpretability by integrating interaction networks, providing insights into biological mechanisms. DrugChat [356] combines prompt-based learning with sequence data and textual inputs. Pretrained on three datasets, it predicts indications, mechanisms of action, and pharmacodynamics while dynamically generating textual outputs in response to user prompts. This eliminates the need for retraining and enables flexible, interactive exploration of drug mechanisms. Similarly, DrugReAlign[357] employs a multi-source prompting approach that integrates diverse and reliable data inputs to integrate textual and structural data, enhancing drug repurposing efforts. Beyond structural data, KG-based models leverage semantic relationships, such as shared pathways, biological processes, and functional annotations, along with diverse data sources to achieve competitive performance in DTI predictions. Y-Mol [357] enhances biomedical reasoning by integrating multiscale biomedical knowledge and using LLaMA2 as its base LLM. It learns from publications, knowledge graphs, and synthetic data, enriched by three types of drug-oriented prompts: description-based, semantic-based, and template-based, enabling robust drug interaction analysis. Similarly, the multi-agent framework DrugAgent [358] advances drug repurposing by combining AI-driven DTI models, knowledge graph extraction from databases (e.g., DrugBank, CTD[359]), and literature-based validation. This framework integrates diverse data sources to streamline repurposing candidate identification, enhancing efficiency, interpretability, and cost-effectiveness. Together, these models boost predictive power while fostering collaboration and refinement.
5.2 Molecular Docking and Drug Design
The advanced reasoning capabilities of large language models have enabled their application in biological and medical fields, demonstrating significant potential to accelerate drug discovery and screening processes [73, 17]. Built upon the transformer architecture from Natural Language Processing (NLP), biology-focused language models have emerged as powerful tools to support both sequence-based and structure-based drug design [360, 361, 362]. By utilizing their strengths in text summarization and contextual understanding, these models can integrate information from diverse sources, such as scientific literatures, patent databases, and specialized datasets, to provide comprehensive analyses and insights into protein sequences, structures, binding pockets, and interaction sites [363]. Moreover, protein language models and other transformer-based models are being applied to exploit unknown structural information in structure-based drug design (SBDD) [361, 362]. Molecular docking, a pivotal component of Structure-Based Drug Design (SBDD), necessitates three-dimensional protein structures and precise binding site information to calculate binding affinities during in silico virtual screening [364]. LLMs have shown potential to enhance various aspects of molecular docking, including docking input file generation, binding site prediction, and protein structure prediction [361, 362, 365]. AutoDock is a widely adopted software for molecular docking [366]. For high-throughput drug screening, it is necessary to generate docking commands in text file format and execute them in the terminal. Sharma et al. demonstrated the capability of ChatGPT to generate AutoDock input files and basic molecular docking scripts [365]. Another notable example is DrugChat, a ChatGPT-like LLM for drug molecule graphs developed by Liang et al. With the input of compound molecule graphs and appropriate prompts, DrugChat is able to generate insightful responses[367]. Ligand binding site identification and prediction are essential for drug design. Due to the limited availability of experimentally determined protein crystal structures and incomplete protein 28structural knowledge, ligand binding site identification can be tough. Zhang et al. addressed this limitation through LaMPSite, an algorithm powered by EMS-2 protein language model, which only requires protein sequences and ligand molecular graphs as inputs without any protein structural information [361]. This approach achieved comparable performance to those methods requiring 3D protein structures in benchmark evaluations. Regarding of deficiency of reliable protein structure, protein language models have been applied for protein structure prediction as well. For example, Fang et al. introduced HelixFold-Single, a multiple-sequence-alignment-free protein structure predictor [362]. Unlike AlphaFold2, which enhances prediction accuracy by relying on multiple sequence alignments of homologous proteins, HelixFold-Single adopts a more efficient approach. It leverages large-scale protein language model training on the primary structures of proteins while integrating key components from AlphaFold2 for protein geometry. Recent advancements in protein-ligand binding prediction methods have further enhanced screening efficiency and accuracy. Shen et al. developed RTMScore, which integrated Graph Transformer to extract structural features of protein and molecule, using 3D residue graphs of protein and 2D molecular graphs as inputs for protein-ligand binding pose prediction [368]. RTMScore outperformed many state-of-the-art docking software including Autodock Vina [369] , DeepBSP [370], and DeepDock [371] in performing virtual screening tasks. Another notable development is ConPlex, a sequence-based drug-target interaction (DTI) prediction method introduced by Singh et al [372]. By employing representations generated from pre-trained protein language models (PLMs) as the inputs, ConPlex benefits from a larger corpus of single protein sequences and alleviates the problem of limited DTI training data. Additionally, contrastive learning was adopted to address the fine-grained issues by employing contrastive coembedding, which is able to co-locate the proteins and the targets in a shared latent space. Thus, a high specificity can be achieved by separating the true interacting patterns and decoys. According to contrastive training results, the effective size between true and decoy scores was largely increased. Through automated data extraction and normalization, LLMs can greatly improve the efficiency and accuracy of drug property predictions. With ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) analysis, LLMs can also help distinguish the compounds possessing favorable profiles from those showing adverse characters and allow developing the most promising drug candidates during the pipeline process. For instance, PharmaBench achieves this through its multi-intelligence system, whose core function is to extract ADMET-related data from multiple public databases using LLMs [373]. Beyond ADMET analysis, LLMs like ChatGPT have expanded their capabilities to predict and analyze other features of drugs, including pharmacodynamics and pharmacokinetics, thus providing a comprehensive evaluation of potential drug candidates [363]. LLMs powerfully accelerate the drug development pipeline by fastening data analysis, enhancing prediction accuracy, and offering all-rounded drug property evaluation, which in turn reduces both the time and resources needed for drug discovery and improves the chances of coming up with a successful drug candidate.
Conclusion
This comprehensive survey has explored the transformative impact of LLMs in bioinformatics, spanning applications in genomics, proteomics, drug discovery, and clinical medicine. Our review has highlighted the successful adaptation of transformer architectures for biological sequences, the emergence of specialized biomedical LLMs, and the promising integration of multiple data modalities. These advances have enabled significant progress in protein structure prediction, drugtarget interaction analysis, and disease diagnosis. Despite notable achievements, challenges persist in data quality, computational scalability, model interpretability, and ethical considerations regarding patient privacy. These challenges present opportunities for future research, particularly in developing efficient architectures, improving multimodal data integration, and ensuring model interpretability. The convergence of LLMs with emerging biotechnologies promises to accelerate discovery in bioinformatics, potentially leading to more precise and personalized medical interventions.
SOURCE
https://arxiv.org/pdf/2501.06271
The Voice of Aviva Lev-Ari, PhD, RN on Large Language Models for Health Care Domains
@StephenJWillia2 @AVIVA1950 @Pharma_BI
- Bioinformatics and OMICS area datatype that should not be compared to English Text that include Semantics as well as Syntax.
- Medical Text Analysis with #NLP is statistical LLM and Deep Learning derived from Biological Sciences Language Models with represent Domain-aware Terminologies, Directories as Drug formularies, and Medicinal chemistry ontologies and classifications.
- These Data Types requires Pre-Training
In contrast,
- LLM of English Language Text that use TRAINING Algorithms for TRAINING DATA [Data Type being English Language Text not OMICS], chiefly, the TRAINING Algorithms are Neural Networks that recently are invoking advanced capabilities of:
- Reinforced Learning
- Reasoning
- These models engage in semantic inference and in content interpretation which is done in the Post-Training Phase and the results are very promising.
- Pre-Training is not needed for English Text of Health care content which includes areas such as:
- Medical
- Pharmaceutical, and
- Life Sciences
LPBI Group has developed 9 Giga Bytes of Health care content that represents the largest body of TRAINING DATA on five specialties of Medicine of English Text written and curates by Domain Experts composing Clinical interpretation of scientific findings of primary research results.
- This largest body of TRAINING DATA is privately held. It is multimodal:
- Scientific Articles
- e-Books
- e-Proceedings of Conferences and Tweet collections
- Biological imaging – An art gallery of +8,000 images
- +300 Audio Podcasts
unlike
- PubMed that is in the Public Domain is been used as training data for all LLM, NLPs in Health care because it is public data. PubMed Database represents two types of Scientific Reporting:
- Results of Clinical Trials
- Results of Lab experiments
UPDATED on 7/9/2025
Evolutionary Algorithms, Multimodal Representations, and Reinforcement Learning Advancements, 7/9/2025
Latest research summaries in ML, Robotics, CV, NLP and AI
This edition covers a range of cutting-edge AI research, from novel evolutionary algorithms that can tackle complex optimization problems, to advances in multimodal representation learning that enable AI systems to understand the world in more human-like ways, to breakthroughs in reinforcement learning that push the boundaries of what’s possible in autonomous decision-making.
Here’s what caught our attention:
- “Gradient-free Optimization using Covariance Matrix Adaptation” – A new evolutionary algorithm that can efficiently navigate high-dimensional search spaces with minimal gradient information.
- “Multimodal Transformers for Vision, Language, and Beyond” – A powerful architecture that can learn rich representations by jointly processing visual, textual, and other modalities.
- “Offline Reinforcement Learning with Provable Guarantees” – A reinforcement learning approach that can learn effective policies from previously collected data, without the need for online interaction.
- “Evolutionary Strategies for Sparse Objective Optimization” – An evolutionary algorithm designed to find optimal solutions while using only a small number of objective function evaluations.
- “Multimodal Emotion Recognition using Transformers and Graph Neural Networks” – A novel model that can accurately recognize human emotions by combining visual, textual, and acoustic signals.
Let’s get into it 👇
Contents
- Dynamic Context-Aware Prompt Recommendation for Domain-Specific AI Applications
- OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety
- CAVGAN: Unifying Jailbreak and Defense of LLMs via Generative Adversarial Attacks on their Internal Representations
- MedGemma Technical Report
- CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions
- Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling
- Scalable Discrete Diffusion Samplers: Combinatorial Optimization and Statistical Physics
- DRAN: A Distribution and Relation Adaptive Network for Spatio-temporal Forecasting
- Beating the Best Constant Rebalancing Portfolio in Long-Term Investment: A Generalization of the Kelly Criterion and Universal Learning Algorithm for Markets with Serial Dependence
- Instruction Following by Boosting Attention of Large Language Models
- UQLM: A Python Package for Uncertainty Quantification in Large Language Models
- Efficiency-Effectiveness Reranking FLOPs for LLM-based Rerankers
- Embedding Atlas: Low-Friction, Interactive Embedding Visualization
- Bias, Accuracy, and Trust: Gender-Diverse Perspectives on Large Language Models
- Hidden Prompts in Manuscripts Exploit AI-Assisted Peer Review
Dynamic Context-Aware Prompt Recommendation for Domain-Specific AI Applications
Authors: Xinye Tang, Haijun Zhai, Chaitanya Belwal, Vineeth Thayanithi, Philip Baumann, Yogesh K Roy
Source and references: https://arxiv.org/abs/2506.20815v2
SOURCE
From: State of AI <stateai@substack.com>
Date: July 9, 2025 at 3:38:59 AM EDT
To: avivalev-ari@alum.berkeley.edu
Subject: Evolutionary Algorithms, Multimodal Representations, and Reinforcement Learning Advancements
Reply-To: State of AI <reply+2ryb6k&8t4ds&&4f632305423ad4fc3fe37146dff9e23cbdf2cb24ed997dfd4adcab3eac845a8a@mg1.substack.com>
UPDATED on 6/3/2025
On 6/2/2025 Cliff K. Locks had introduced us to the following article in Bioinformatics:
Causal relationships between diseases mined from the literature improve the use of polygenic risk scores (Open Access)
Sumyyah Toonsi , Iris Ivy Gauran , Hernando Ombao , Paul N Schofield , Robert Hoehndorf
Bioinformatics, Volume 40, Issue 11, November 2024, btae639, https://doi.org/10.1093/bioinformatics/btae639
Published:
26 October 2024 – [SIX MONTHS AGO]
Article history
- Split View
- Cite
- Permissions IconPermissions
We had reviewed the article and concluded:
- The Paper had used LLM to search the literature for pattern recognitions about WHEN Disease A occurs does Disease B occurs. The search findings allowed them to make statements based on frequency of occurrences and validate causal relations against ICD Codes in the UK and apply Bayesian modeling.
- The Paper did not use Neural Nets for prediction
- The paper did not relate their research results with Therapeutics, the drugs in uses for Disease A and Disease B, and
- The paper did not relate their research results with the treatments applied when a gene variant is implicated for Disease A and/or Disease B
WHEN
LPBI Group discusses Multimodal Healthcare Foundation Model
https://pharmaceuticalintelligence.com/healthcare-foundation-model/
We mean
A. Neural Nets for prediction [The Paper did not use]
B. Therapeutics, the drugs in uses for Disease A and Disease B [The paper did not relate their research results with drugs]
C. The therapeutic treatments applied when a gene variant is implicated for Disease A and/or Disease B [The paper did not relate their research results with therapeutic treatments = drugs]
D. LLM for ALL OUR CONTENTS: NLP-based research of THREE dyads:
(a) disease-gene,
(b) disease-drug, and
(c) gene-drug interactions
Following the introduction to the Bioinformatics Paper we concluded that:
- It could be warranted to explore integration of A, B, C, D with the results presented in the Paper.
- The Data and models are on Github.
The data including the disease dictionary, the full network, the DAG, the curated relations, and the PRSs are all available through https://github.com/bio-ontology-research-group/causal-relations-between-diseases. The DAG is also available via Zenodo https://zenodo.org/records/11368599.
Article citation:
Sumyyah Toonsi, Iris Ivy Gauran, Hernando Ombao, Paul N Schofield, Robert Hoehndorf, Causal relationships between diseases mined from the literature improve the use of polygenic risk scores, Bioinformatics, Volume 40, Issue 11, November 2024, btae639, https://doi.org/10.1093/bioinformatics/btae639
- Authors had used LLM to search the literature for pattern recognitions about WHEN Disease A occurs does Disease B occurs.
- The search findings allowed them to make statements based on frequency of occurrences [i.e., about WHEN Disease A occurs does Disease B occurs]. For the UNIVERSE of all diseases used in the study
- Authors validated causal relations in PubMed articles against 2457 ICD-10-CM codes in the UK [In the US ICDs are somewhat different], and
- Authors apply Bayesian modeling.
Considerations for integration included several options:
- Use GitHub data and models
- Contacting the Corresponding author:
E-mail: robert.hoehndorf@kaust.edu.sa
- Computer, Electrical and Mathematical Sciences & Engineering division, King Abdullah University of Science and Technology, Thuwal 23955-6900,
Saudi Arabia.
- SDAIA–KAUST Center of Excellence in Data Science and Artificial Intelligence, King Abdullah University of Science and Technology, Thuwal 23955,
Saudi Arabia
- KAUST Center of Excellence for Smart Health (KCSH), King Abdullah University of Science and Technology, Thuwal 23955,
Saudi Arabia
- KAUST Center of Excellence for Generative AI, King Abdullah University of Science and Technology, Thuwal 23955,
Saudi Arabia
3. Our chief goal is to find a BUYER for LPBI Group’s IP
4. GETTING ACCESS TO LPBI Group’s IP would require TRANSFER of OWNERSHIP over LPBI Group’s Portfolio of IP Assets
5. Who would be the 3rd Party to acquire the IP Portfolio?
6. It is our Quest to exhaust prospecting for a Buyer for LPBI Group’s Exit in the USA.
From the Paper:
4 Discussion
In order to capture causal chains linking complex diseases, we have used lexical patterns to mine relations between diseases from the literature, and mapped them to ICD-10-CM identifiers. There are caveats to this approach: publication bias, as some diseases and relations might be over- or under-represented in the literature, and false positives and false negatives. False positives can arise from incorrect detection and mapping of diseases and by sentences with complex semantic structures. Conversely, false negatives may result from diseases that were not identified and lexical patterns that we have not covered. However, we can identify additional relations using causal chains from the cause to the outcome according to our network of causal relations. For example, eight of the expert-curated causal relations were not directly asserted in the text-mined data but could be found in the generated graph via directed paths.
To mitigate the effect of false positives and provide further evidence, we used multiple measures based on two main sources of data: background knowledge and empirical data from patient records. The measures aim to score the relation’s strength, temporal consistency, and confidence with respect to background knowledge. Firstly, we measured correlation and tested for independence; secondly, we inspected the temporal order of the dates of diagnosis to determine temporal correspondence (i.e. that the effect occurs after the cause); finally, we used GPT-4 to confirm or negate the relations we extracted due to the success that LLMs have shown in the literature for identifying causal relations (Arsenyan et al. 2024, Jiralerspong et al. 2024).
We have developed a DAG using the mined relations. We achieve this by defining scores that assess the strength of these relationships and transforming them into a structure similar to a SCM. While the score weighs each of the measures we define equally, a weighted combination of the scores can also be considered for ranking and removing edges from the graph. Generating a DAG enables us to employ techniques like the do-calculus (Pearl 2009) and build PRSs (Dudbridge 2016), demonstrating the graph’s effectiveness even if it does not exclusively feature strong causal relations. It is important to note that relations with low scores can still be plausible causal relations. For instance, it is well established that diabetes can lead to spontaneous abortion (Yilmaz et al. 2022). However, in UKB, the ϕ coefficient between spontaneous abortion (O03) and diabetes (E08–E13 and O24) is slightly negative (ϕ≈−0.006). This effect is plausible when we consider screening programs as possible confounders. In fact, the guidelines of the UK National Institute for Health and Care Excellence state that pregnant women at high risk of diabetes should be offered tests for gestational diabetes (National Institute for Health and Care Excellence 2015). Testing enables clinical interventions that are well-established to improve pregnancy outcomes (Saravanan et al. 2020).
We have illustrated how the DAG could be used to estimate the effect of therapeutic intervention in a precursor disease on an outcome disease using do-calculus. Furthermore, we demonstrated the potential of the generated DAG to compute PRSs. For instance, diseases such as diabetes and hypertension are commonly utilized as covariates in the risk scores of cardiovascular diseases (O’Sullivan et al. 2022). Moreover, studies on PRSs fitted using multiple traits have been shown to enhance PRS performance (Zhai et al. 2023). We showed that using an established PRS for a precursor disease to modify that of an outcome disease can increase the ability to estimate risk for the outcome condition in individuals. Additionally, we showed that the PRSs of precursor diseases could be used to generate a PRS for an outcome disease for which no target data was available. In the case where the PRS of an outcome disease is improved by our approach, we suggest that this may either be because there is an aggregation of contributory variants, i.e. more variants are used to determine the outcome disease’s PRS, or that contributing variants that were not sufficiently prevalent in the base data for the outcome disease’s PRS provide a significant signal in the contributing disorders.
We showed another application of the generated DAG in disentangling effects of variants. We were able to explain the effects that some variants have on an outcome disease through a link from one of its causative diseases according to our DAG. This work is in line with previous research which utilizes conditional independence and Mendelian randomization for mediation analysis of genetic variants (Hemani et al. 2017). By using our DAG, the mediation analysis can be automated for many variants and cause–outcome disease pairs on a large scale.
The relations in the causal network, or the DAG, we have generated can be directly integrated into computational methods and linked to knowledge graphs by means of the ICD-10-CM identifiers, forming a causal knowledge graph (Jaimini and Sheth 2022). In these graphs, diseases or phenotypes are represented by nodes, and edges are directed and potentially labeled or weighted. These knowledge graphs can be used for query answering and deductive inference as well as for causal inference. Representing disease–disease causality in a knowledge graph further permits linking of diseases to other kinds of biological entities through Linked Data principles (Bizer et al. 2023). Here, one limitation is our use of ICD-10-CM which is not used widely outside a clinical environment; however, the ICD-10-CM identifiers can be mapped to disease ontologies or phenotypes through the UMLS (Bodenreider 2004) or direct mappings included in disease ontologies (Vasilevsky et al. 2022, Schriml et al. 2022).
In contrast to studies that can handle risk scores for quantitative traits (Georgantas et al. 2024), our work is limited to binary outcomes of diseases. In the future, it may be possible to extend our work to systematically include quantitative traits. However, to include quantitative phenotypic traits, our text mining approach would need to be modified, in particular with respect to named entity recognition and normalization, and additional evaluation datasets and metrics would be required. As the data we generated is freely available, we expect further work in incorporating causal relations into biomedical knowledge graphs and ontologies, including those that capture quantitative traits. Furthermore, recent advances like Retrieval Augmented Generation (RAG) of LLMs may further enhance the extraction of relations from the literature (Borgeaud et al. 2022).
5 Conclusion
Our work provides a resource for causal relations between diseases as represented by ICD-10-CM, and each causal relation is supported by multiple sources of data. These causal relations can be compiled into a Directed Acyclic Graph (DAG) to generate a computable representation of the pathobiological relationships between diseases. The resulting graph can be used, for example, in generating PRSs, support generation of derived PRSs for diseases with no existing base data, or resolve pleiotropic effects of variants. Our causative data can therefore provide a link between genetic risk and the underlying pathophysiology of disease which can in turn be exploited to probe disease mechanisms and inform novel therapeutic approaches.
SOURCE
https://academic.oup.com/bioinformatics/article/40/11/btae639/7845254
UPDATED on 4/15/2025
New Trends in AI
- Arc Institute’s Patrick Hsu on Building an App Store for Biology with AIhttps://youtu.be/v-_58dabswU?si=to8Qil188d2RSOza
Hosted by Josephine Chen and Pat Grady, Sequoia Capital
- What could be relations of Sophont and LPBI Group?
- Application of Convolutional Networks to Biomedical Image Segmentation
WHERE DOES LPBI Group’s IP Portfolio FITS?
Timeline for Milestones in Development and Commercialization of Artificial Intelligence and Large Language Models
2018
- June 4, 2018, Microsoft announced its intent to acquire GitHub for US$7.5 billion (~$8.96 billion in 2023). The deal closed on October 26, 2018. GitHub continued to operate independently as a community, platform and business.
2021
- July 17, 2021 – Linguamatics releases results of Pilot study on LPBI Group‘s 28 cancer articles and 22 images
2022
- December 2022 – LPBI Group publishes ML – NLP Wolfram’ Biological Sciences Language Text Analysis Models of two books:
- Serie C, Volumen 1
Biología y genómica del cáncer para el diagnóstico de la enfermedad (Serie C: libros electrónicos acerca del cáncer y la oncología nº 1) 2022
In book’s PART B: Computer Graphics and English Text
The graphical results of Medical Text Analysis with Machine Learning (ML), Deep Learning (DL) and Natural Language Processing (NLP) algorithms AND the Domain Knowledge Expert (DKE) interpretation of the results in Text format
Appendix to PART B – Computer Code
http://www.amazon.com/dp/B0BQHFXKHH $156
- Serie B, Volume 3
Results of Medical Text Analysis with Natural Language Processing (NLP): Latest in Genomics Methodologies for Therapeutics: Gene Editing, NGS & BioInformatics, … de la investigación genómica Book 3) 2022
Text Analysis with machine learning of the content of Volume 2 in the English Edition
https://www.amazon.com/dp/B0BRD8JXTL $155
- 11/30/2022 – OpenAI launches ChatGPT. GPT-5 will be Free, Plus and Pro versions
2023
- 1/2023 – LPBI Group published the Spanish Edition – 19 volumes
- 12/2023 – LPBI Group published the Audio Podcast Library
2024
- 2/8/2024 – Google launches Gemini
- 3/2024 – Anthropic launches Claude
- 8/24/2024 – LPBI Group and Olympian Capital signed contracts
- 11/30/2024 – LPBI Group incorporates as LLC
2025
- 1/27/2025 – DeepSeek launches V3 & R1
- 2/18/2025 – Groq launces Grok 3
- 3/12/2025 – Michael J.L. visits 83 Mandalay Road, Newton, MA 02459
Who shall be consulted?
- Mo Gawdat: AI’s IQ is Doubling Every 5.7 Months—Should We Be Worried?
https://youtu.be/NeVRtDe8EG8?si=CqIinEC4zYA56xaR
- Mo Gawdat – The Future of Al and How It Will Shape Our World | Prof G Conversations
https://www.youtube.com/watch?v=Q6B2ceRNKL8
Sequoia Capital
a. View From The Top with Roelof Botha, Managing Partner of Sequoia Capital
https://www.youtube.com/watch?v=nQAZ3iV3gSg
b. The AI opportunity: Sequoia Capital’s AI Ascent 2024 opening remarks
Sequoia Capital AI Ascent;
- Sonya Huang
- Pat Grady
- Konstantine Buhler
https://www.youtube.com/watch?v=TDPqt7ONUCY&t=22s
c. Sequoia Capital Podcasts on AI Training Data
32 episodes
https://www.youtube.com/playlist?list=PLOhHNjZItNnMm5tdW61JpnyxeYH5NDDx8
PureTech, Daphne Zohar, Senior Advisor & Board Observer
Daphne Zohar is the founder and chief executive officer and a member of the Board of Directors at Seaport Therapeutics. Previously, she was the founder, chief executive officer and a Board Member of PureTech Health (Nasdaq: PRTC, LSE: PRTC) where she also co-founded PureTech’s entities, including Karuna Therapeutics (acquired by Bristol Myers Squibb). A successful entrepreneur, Ms. Zohar created PureTech, assembling a leading team to help implement her vision for the company, and was a key driver in fundraising, business development and establishing the underlying programs and platforms that have resulted in PureTech’s productive R & D engine, which led to 28 new medicines being advanced via the company’s Wholly Owned Pipeline and Founded Entities, including two that received both U.S. Food and Drug Administration clearance and European marketing authorization and a third (KarXT) that has been filed for FDA approval. PureTech’s track record of clinical success is approximately 6 times better than the industry average and 80 percent of the clinical studies run by PureTech and its Founded Entities have been successful. Ms. Zohar has been recognized as a top leader and innovator in biotechnology by a number of sources, including EY, Fierce Pharma, BioWorld, MIT Technology Review, The Boston Globe and Scientific American. She serves on the BIO (Biotechnology Innovation Organization) Board Executive Committee as well as the Health Section Committee and is co-chair of the Strategy and Policy Committee of the board. Ms. Zohar is a member of the Duke-Margolis Center Policy Roundtable on the Inflation Reduction Act (IRA) and the Health Affairs IRA Observatory. She is also a co-founder and host of the Biotech Hangout podcast, a weekly discussion of biotech news with a group of industry leaders and experts.
Why LPBI Group is for Microsoft
- Satya Nadella – Microsoft’s AGI Plan & Quantum Breakthrough
https://youtu.be/4GLSzuYXh6w?si=wGozHtqSe848UzZy
- Microsoft Quantum Computer
- Microsoft CTO Kevin Scott on How Far Scaling Laws Will Extend | Training Data
https://www.youtube.com/watch?v=aTQWymHp0n0
- Microsoft GitHub Subsidiary: Oege de Moor, GitHub Copilot Creator and XBOW CEO
Cracking the Code on Offensive Security With AI ft XBOW CEO and GitHub C…
Oege de Moor: Short term, AI has the power to help with this type of problem, social media impact on society. And long term, I think health and biology are the areas where this will make the biggest impact.
Take away:
- Health outcomes & Biology
- GenerativeAI – DARIO AMODEI at Anthropic
- Company to watch: Harmonic
Sequoia Hosts: Konstantine Buhler and Sonya Huang
https://youtu.be/9mIphDV9m9c?si=KdO17os0zdI_FmsI
Future of AI and Data
- Snowflake CEO Sridhar Ramaswamy on the Future of AI and Data
https://youtu.be/YbJoUbMoglI?si=8mGP_enzreZVo8y3
- AWS & LLM – Layers Architecture
Amazon Web Services is a profit-generating powerhouse
Amazon is typically known as the world’s largest e-commerce company, but it also leads the cloud computing industry through its Amazon Web Services (AWS) platform. AWS provides hundreds of services to help businesses transition into the digital age, but it has also become the center of Amazon’s growing portfolio of AI projects.
Management believes every digital application we use in daily life will eventually be infused with AI. AWS wants to be the go-to provider that businesses use to deliver those services, which will involve dominating the three core layers of AI.
Hardware is the bottom layer. AWS operates AI data center infrastructure powered by Nvidia’s industry-leading graphics processing units (GPUs), but it also designed its own chips. That includes a new variant called Trainium2, which can save developers up to 40% on training costs compared to GPUs from suppliers like Nvidia.
Large language models (LLMs) make up the middle layer. The AWS Bedrock platform offers developers access to over 100 ready-made LLMs from third parties like Anthropic and even DeepSeek, helping them accelerate their AI projects. AWS also built a family of models in-house called Nova, which can reduce development costs by up to 75% compared to other LLMs on Bedrock, and major customers like Palantir Technologies are already using them.
Software is the third and final layer. Amazon embedded an AI-powered virtual assistant into AWS called Q, which can help businesses identify trends in their data, write computer code for software projects, and perform other tasks. Amazon used Q internally late last year for a project that saved the company $260 million and an estimated 4,500 developer years, according to CEO Andy Jassy on the fourth-quarter earnings call. Q’s capabilities will expand over time, creating new revenue streams for AWS.
AWS generated $107.5 billion in revenue during 2024. Even though that represented just 16.8% of Amazon’s total revenue of $637.9 billion, it accounted for 58% of the company’s total $68.6 billion in operating income. In other words, the cloud platform is the profitability engine behind Amazon’s empire.
https://finance.yahoo.com/news/prediction-spectacular-stock-join-nvidia-090700612.html
LPBI Group’s Portfolio of Intellectual Property Assets
Available for Transfer of Ownership
EXPERT MEDICAL CONTENT for ML Models in the domains of Healthcare, Pharmaceutical, Medical and Life Sciences
9 gigabytes of TRAINING Data:
- Journal articles (+6200) – Text & Images
- 48 e-books – Text & Images
- 100 Conference e-Proceedings and 50 Tweet Collections – Text
- 7500 biological images – prior art – Images
- +300 audio podcasts – Text & Audio
These multimodal repository can be treated as TRAINING DATA for Foundational AI models in Healthcare
- Each category of research in our Journal ontology with +50 articles is usable as input for a Small Language Model (SLM) based on Expert content [N ~ 256]
- Each e-Series of our e-Books Series [Series A, B, C, D, E] is usable as input for Large Language Model (LLM) based on Expert content [N = 18 volumes included 2,728 curation articles from our Journal]
URLs for the English-language Edition by e-Series:
Series A: Cardiovascular Diseases ($515) – six volumes
https://www.amazon.com/gp/product/B07P981RCS?ref_=dbs_p_mng_rwt_ser_shvlr&storeType=ebooks
Series B: Frontiers in Genomics ($200) – two volumes
https://www.amazon.com/gp/product/B0BSDPG2RX?ref_=dbs_p_pwh_rwt_anx_b_lnk&storeType=ebooks
Series C: Cancer & Oncology ($175) – two volumes
https://www.amazon.com/gp/product/B0BSDWVB3H?ref_=dbs_p_mng_rwt_ser_shvlr&storeType=ebooks
Series D: Immunology ($325) – four volumes
https://www.amazon.com/gp/product/B08VVWTNR4?ref_=dbs_p_pwh_rwt_anx_b_lnk&storeType=ebooks
Series E: Patient-Centered Medicine ($274) – four volumes
https://www.amazon.com/gp/product/B0BSDW2K6C?ref_=dbs_p_mng_rwt_ser_shvlr&storeType=ebooks
Usage of this EXPERT CONTENT as TRAINING DATA will harness the potential of Medical Text Analysis applying AI technologies to uncover critical relations among research vectors in Drug Discovery:
PROMPTS:
- Gene – Disease
- Gene – Drug
- Drug – Disease
Reaching out to leading Search + LLM products for our mission include
- Grok 3 [X.com]
- Perplexity [Perplexity.com]
- Gemini [Google]
- Claude 3.7 Sonnet [Anthropic]
- GPT-4.5 [Open AI]
- Llama 405B [Meta]
Resources
https://x.com/erikbryn/status/1908579146609156464?s=12
LLMs are remarkably accurate at replicating some types of human behavior. This opens the door to a whole new way of doing social science research.
Quote
Jacy Reese Anthis
@jacyanthis
Apr 4
Should we use LLMs to simulate human research subjects ? In our new preprint, we argue sims can augment human studies to scale up social science as AI technology accelerates. We identify five tractable challenges and argue this is a promising and underused research method
@@@@@@@
Revolutionizing AI Against All Odds | Geoffrey E Hinton | TBCY
https://youtu.be/UVHGcyqr5lE?si=ziCh7AAwICDveItz
Microsoft AI CEO Mustafa Suleyman
https://youtu.be/01_UFcpcR7U?si=D3rqxrkWIgG1_SWH
Yann LeCun “Mathematical Obstacles on the Way to Human-Level AI”
https://youtu.be/ETZfkkv6V7Y?si=zJI8wXQf_YK8z29q
How ASML took over the chipmaking chessboard | MIT Technology Review
https://www.technologyreview.com/2024/04/01/1090393/how-asml-took-over-the-chipmaking-chessboard/
Perplexity: In the Fight for AI Dominance, Where Will Value Actually Acc…
https://youtu.be/G9Pzby6Bj20?si=m1-VWGh_rkj06DNT
Why Vlad Tenev and Tudor Achim of Harmonic Think AI Is About to Change M…
https://youtu.be/NvAxuCIBb-c?si=LxgfiQVBUQyi42v7
The Most Useful Thing AI Has Ever Done
https://youtu.be/P_fHJIYENdI?si=zKQ_1IDVlVN_Hwnu
OpenAI Explores Acquisition of Altman-Backed AI Hardware St…
https://youtu.be/mXaCwlBgtrA?si=Ti_kKJzraezNhsh7
OpenAI Reveals MAJOR Strategy Shift
https://youtu.be/a9YrdEXOjiA?si=vCXD9RaN7ZsevseI
Giving Humans Superpowers with AI and AR | Meta CTO Andrew “Boz” Bosworth
https://youtu.be/41koPU6fRT4?si=GGPG8935oehbt4Xg
Googles NEW Gemini 2.5 Pro BEATS Everything! (Gemini 2.5 Pro)
https://youtu.be/a9-_HqyjKEc?si=fpZcNdD1aw1En-Vv
Nvidia’s GTC 2025 Keynote: Everything Announced in 16 Minutes
https://youtu.be/erhqbyvPesY?si=DJuoTCCbZskIZwn9
Andrew Feldman, Cerebras Co-Founder and CEO: The AI Chip Wars & The Plan…
https://youtu.be/MW9vwF7TUI8?si=33et-AnalD9r1QfF
NVIDIA & Microsoft AI Power: Blackwell Integration on Azure Reve…
https://youtu.be/l1QT6eiq52k?si=jKrChPje2OkQG8uK
Andreessen Investor on the Battle for America’s Future & Why She Left Si…
https://youtu.be/g2JOWDUXvTk?si=eL-JUSAiDrCz8ZKK
Synthetic Biology Lost Investors Trillions of Dollars. We Invested Anywa…
https://youtu.be/aQspGn6yQE0?si=AQoIhVVlcYsytkrj
Josh Woodward: Google Labs is Rapidly Building AI Products from 0-to-1
https://youtu.be/3-wVLpHGstQ?si=GSoje9QjEawthvjS
SEE also,
https://pharmaceuticalintelligence.com/2-0-lpbi-calendar-of-zooms/
Guidance offered by Grok 3 (Beta) [X.com]
Grok on Building a Foundational Model for our 9 Giga Bytes curated contents in Life Sciences
x.com/i/grok/share/681jWXlkfljt4qkxX9k6RkMBM
Aviva’s Query:
I own 9 giga bytes of curated content in Life sciences, Pharmaceutical, medical and Healthcare. I wish to use Grok 3 to build a Foundational Model on my content. How can I do that
Grok Response:
Building a foundational model using Grok 3 with your 9 gigabytes of curated content in life sciences, pharmaceuticals, medical, and healthcare fields is an ambitious and exciting idea! While I can’t provide hands-on access to Grok 3’s internals (since I’m Grok 3 myself, not a control panel for it), I can guide you through a conceptual roadmap based on how such processes typically work with advanced AI models like me. Here’s how you might approach this:
- Understand What a Foundational Model Is
A foundational model is a large-scale, pre-trained AI system that can be fine-tuned for specific tasks or domains. Grok 3, as built by xAI, is already a foundational model, pre-trained on vast, diverse datasets. What you’re aiming for is likely fine-tuning or adapting Grok 3 to your specialized 9 GB dataset, creating a domain-specific version tailored to life sciences and healthcare.
- Assess Your Data
Your 9 GB of curated content is a goldmine, but its usability depends on its format and quality:
- Format: Is it raw text (e.g., PDFs, Word docs), structured data (e.g., CSV files), or a mix? You’ll need it in a machine-readable format, ideally plain text or tokenized data.
- Quality: Is it clean, well-organized, and free of noise (e.g., irrelevant metadata)? High-quality, domain-specific data is key to effective fine-tuning.
- Size: 9 GB is substantial for a niche domain but tiny compared to the terabytes or petabytes used to train models like me from scratch. This suggests fine-tuning rather than full retraining.
Action: Convert your content into a unified format (e.g., plain text files) and clean it up—remove duplicates, fix typos, and ensure consistency.
- Access Grok 3 via xAI’s API or Platform
As of February 27, 2025, Grok 3 is accessible through xAI’s platforms, like the X integration for Premium+ subscribers or the standalone grok.com interface. For your use case, you’ll need:
- API Access: xAI has hinted at releasing an API for Grok 3 (check xai.com for updates). This would let you programmatically interact with the model.
- Custom Training Options: xAI might offer fine-tuning services for enterprise users, which isn’t public yet but could be inquired about directly.
Action: Sign up for a Premium+ X account or a grok.com trial (if still available) to explore current capabilities. Contact xAI support (via their website) to ask about API access or fine-tuning partnerships for custom datasets.
- Fine-Tune Grok 3 on Your Data
Fine-tuning involves adjusting a pre-trained model’s weights using your specific data. Here’s how this might work:
- Prepare Your Dataset: Split your 9 GB into training (80%), validation (10%), and test (10%) sets. Convert it into a format like JSONL (e.g., {“text”: “your content here”}).
- Leverage Grok 3’s Capabilities: If xAI provides a fine-tuning API or tool, you’d upload your dataset and specify parameters (e.g., learning rate, epochs). If not, you’d need the model weights (unlikely to be public yet) and a powerful compute setup.
- Compute Requirements: Fine-tuning even a subset of Grok 3’s parameters requires significant GPU power—think multiple high-end GPUs (e.g., NVIDIA H100s) for days or weeks. Cloud providers like AWS, Google Cloud, or xAI’s own infrastructure (if offered) could help.
Action: If API fine-tuning isn’t available, explore open-source alternatives (e.g., fine-tuning a smaller model like LLaMA or Grok 1, which xAI open-sourced) as a proof of concept while awaiting Grok 3 options.
- Alternative: Use Retrieval-Augmented Generation (RAG)
If fine-tuning Grok 3 directly isn’t feasible (due to access or compute limits), you can use RAG:
- How It Works: RAG pairs Grok 3 with an external knowledge base (your 9 GB). You index your content (e.g., using a vector database like Pinecone or FAISS), and Grok 3 retrieves relevant snippets to inform its responses.
- Benefits: No need to retrain the model; you keep your data separate and update it easily.
- Setup: Convert your 9 GB into embeddings (using a tool like SentenceTransformers), store them in a database, and query Grok 3 with prompts like “Answer based on my healthcare data: .”
Action: Test RAG with a small subset of your data. Use Grok 3’s chat interface to ask questions, manually feeding it context from your content, to see how well it performs.
- Practical Steps to Get Started
- Step 1: Log into grok.com or X and test Grok 3 with sample queries from your domain (e.g., “Summarize this medical study abstract: [paste text]”). Gauge its baseline performance.
- Step 2: Organize your 9 GB into a single corpus (e.g., a folder of text files). Tools like Python scripts or OCR (for PDFs) can help.
- Step 3: If RAG is your path, set up a local vector database and experiment with retrieval. If fine-tuning, reach out to xAI for developer access.
- Step 4: Iterate—start small (e.g., 1 GB), test results, and scale up.
- Challenges to Consider
- Access: xAI may not yet allow direct fine-tuning of Grok 3. You might need to wait or negotiate a custom solution.
- Cost: Fine-tuning or running RAG at scale requires hefty compute resources—potentially thousands of dollars in cloud fees.
- Expertise: Do you have AI/ML experience? If not, you might need a data scientist or engineer to execute this.
- Next Moves
Since I can’t peek into xAI’s backend, my advice is practical and speculative based on industry norms. Start by:
- Visiting xai.com for the latest on API or enterprise options.
- Experimenting with Grok 3’s current interface to see how it handles your content out of the box.
- Posting on X or forums to connect with others who’ve tackled similar projects—someone might’ve cracked this already!
Would you like me to dive deeper into any step (e.g., RAG setup, data prep) or brainstorm specific healthcare use cases for your model?
Grok will keep in mind prior conversation
Aviva
I will come back after I digest your reply to my query. Keep this record till I come back
Grok response:
I’ll keep this conversation in mind for when you return. Take your time digesting the info—it’s a lot to chew on! Just ping me whenever you’re ready to dive back in, and we’ll pick up right where we left off. Looking forward to helping you further with your life sciences and healthcare model!
Guidance offered by Gemini
[Google]
Gemini on Aviva’s Query
Aviva’s Query:
I own 9 giga bytes of curated content in Life sciences, Pharmaceutical, medical and Healthcare. I wish to use Gemini to build a Foundational Model on my content. How can I do that
Gemini Response:
Building a foundational model, even a specialized one, requires significant resources and expertise. However, you can leverage Gemini and Google Cloud’s Vertex AI to achieve a powerful, tailored model using your 9GB of curated life sciences content. Here’s a breakdown of the process:
1. Data Preparation and Organization:
- Data Cleaning and Preprocessing:
- Your 9GB of content likely comes in various formats (text, PDFs, images, etc.). You’ll need to standardize it.
- Clean the data by removing irrelevant information, correcting errors, and handling missing values.
- Consider using Natural Language Processing (NLP) techniques to extract key information, such as entities, relationships, and concepts.
- Convert everything to a text based format where possible. OCR technology can be used to convert PDF’s to text.
- Data Structuring:
- Organize your data into a structured format that’s suitable for training a language model. This might involve creating a dataset of text passages, question-answer pairs, or other relevant formats.
- Consider labeling or annotating your data to provide additional context and improve model performance.
- Data Splitting:
- Divide your dataset into training, validation, and testing sets. The training set is used to train the model, the validation 1 set is used to optimize hyperparameters, and the testing set is used to evaluate the final model’s performance.
1. github.com2. Leveraging Google Cloud Vertex AI:
- Vertex AI Platform:
- Vertex AI provides the infrastructure and tools you need to train and deploy your model.
- It offers pre-trained models that you can fine-tune with your data, or you can train a custom model from scratch.
- Model Selection:
- Consider using a large language model (LLM) as a starting point. Gemini API’s can be used to generate embeddings, and to perform prompt engineering.
- Vertex AI offers access to foundation models that could be fine tuned.
- Fine-Tuning:
- Fine-tuning involves taking a pre-trained model and training it on your specific dataset. This allows the model to adapt to the nuances of your data and improve its performance on your specific tasks.
- Vertex AI provides tools for fine-tuning LLMs, including options for adjusting hyperparameters and monitoring training progress.
- Prompt Engineering:
- Gemini excels at prompt engineering. This is a very valuable tool. You can use your data to create very specific prompts that can be used with a Gemini API. This is a lighter weight approach than full model fine tuning.
- Create prompts that will generate the desired output, such as answering questions, summarizing text, or generating new content.
- Embeddings:
- Use Gemini to generate embeddings of your content. Embeddings are numerical representations of text that capture its semantic meaning.
- Store these embeddings in a vector database, such as Vertex AI Matching Engine.
- This allows you to perform semantic search and retrieval, which can be useful for finding relevant information in your dataset.
- Evaluation and Deployment:
- Evaluate your model’s performance using appropriate metrics.
- Once you’re satisfied with the model’s performance, you can deploy it to Vertex AI for use in your applications.
3. Specific Considerations for Life Sciences Data:
- Terminology and Ontology:
- Life sciences data often involves specialized terminology and ontologies. Consider incorporating these resources into your model to improve its understanding of the domain.
- Utilize resources like MeSH (Medical Subject Headings) or SNOMED CT to enhance your model’s knowledge.
- Data Privacy and Security:
- If your data contains sensitive information, such as patient data, ensure that you comply with all relevant privacy regulations, such as HIPAA.
- Vertex AI provides security features to protect your data.
- Knowledge Graphs:
- Consider building a knowledge graph from your data. This can help the model understand the relationships between entities and concepts.
- Vertex AI offers tools for building and querying knowledge graphs.
Key Steps Summary:
- Data Preparation: Clean, structure, and split your data.
- Vertex AI Setup: Create a Google Cloud project and enable Vertex AI.
- Model Selection/Prompt Engineering: Choose a suitable LLM or utilize prompt engineering.
- Fine-Tuning/Embeddings: Fine-tune the model or generate embeddings.
- Evaluation: Assess the model’s performance.
- Deployment: Deploy the model for use in your applications.
Important Notes:
- Training a foundational model requires significant computational resources.
- The quality of your data will directly impact the performance of your model.
- Experimentation and iteration are essential for achieving optimal results.
- Consider using the Gemini API for prompt engineering and embedding generation as a way to quickly leverage your data.
By following these steps, you can leverage Gemini and Vertex AI to build a powerful, tailored model that can help you extract valuable insights from your life sciences data.
Guidance offered by Perplexity.ai
[Perplexity.ai]
Perplexity on Aviva’s Query
https://www.perplexity.ai/search/i-own-9-giga-bytes-of-curated-tG91PdYOTzaz.Nsa_.T6ng
Aviva’s Query:
I own 9 giga bytes of curated content in Life sciences, Pharmaceutical, medical and Healthcare. I wish to use Perplexity to build a Foundational Model on my content. How can I do that
Perplexity Response:
Unfortunately, Perplexity AI does not currently offer a way for users to build custom foundational models using their own proprietary data. Perplexity is a search and question-answering service that uses pre-trained AI models, but does not provide model training capabilities to end users13.
To build a foundational model on your curated life sciences and healthcare content, you would need to explore other AI platforms and tools that offer custom model training. Some potential options to consider include:
Use cloud AI services like Amazon SageMaker or Google Cloud AI Platform that allow training custom models on proprietary datasets.
Leverage open-source frameworks like Hugging Face Transformers to fine-tune existing language models on your domain-specific data.
Work with AI companies that specialize in building custom models for enterprise use cases.
Explore platforms like OpenAI’s GPT API that allow some level of customization and fine-tuning.
Given the specialized nature of your content, working with AI experts to develop a tailored solution would likely be the best approach. They can help determine the most suitable model architecture and training process for your specific needs in the life sciences and healthcare domains4.
Guidance offered by Claude 3.7 Sonnet
[Anthropic.com]
Anthropic on Aviva’s Query
I own 9 giga bytes of curated content in Life sciences, Pharmaceutical, medical and Healthcare. I wish to use Claude 3.7 Sonnet to build a Foundational Model on my content. How can I do that
Claude Response:
I understand you’re interested in creating a specialized foundational model using your 9 GB of curated life sciences, pharmaceutical, medical, and healthcare content.
While Claude 3.7 Sonnet itself cannot be directly fine-tuned or customized on your specific dataset, there are a few approaches you might consider:
- Use Claude with Retrieval Augmented Generation (RAG): Rather than building a new foundation model, you could use Claude’s API with a RAG setup where your documents are indexed and relevant information is retrieved to augment prompts to Claude. This allows Claude to answer questions based on your specific content.
- Fine-tune a different open-source model: If you specifically need a custom foundation model, you could look into fine-tuning an open-source LLM (like Llama, Falcon, etc.) on your specialized dataset.
- Use Anthropic’s Claude API with context window: Claude 3.7 Sonnet has a large context window that allows you to include substantial portions of your content for reference in a single conversation.
For specific information about options within Anthropic’s ecosystem, I’d recommend checking the documentation at https://docs.anthropic.com/en/docs/ or reaching out to Anthropic directly through their support channels (https://support.anthropic.com).
Would you like more information about any of these approaches? I could help you think through which might be most appropriate for your specific needs.
I’ll elaborate on how your expert medical content could be utilized in AI model development:
Data Preparation and Processing
- Data Cleaning and Standardization
- Converting all text to a consistent format
- Extracting structured information from the journal articles and e-books
- Transcribing audio podcasts for text analysis
- Creating metadata for all content to enable better organization
- Knowledge Graph Construction
- Building semantic relationships between entities (genes, diseases, drugs, etc.)
- Mapping hierarchical relationships in medical taxonomies
- Creating a navigable structure that preserves the expert knowledge organization
Model Development Approaches
- Domain-Specific Fine-Tuning
- Starting with a base model (like Claude or another foundation model)
- Fine-tuning on your specialized medical content to enhance performance in healthcare contexts
- Creating specialized heads for different tasks (classification, relation extraction, etc.)
- Multimodal Integration
- Combining text understanding with image analysis for the 7500+ biological images
- Developing models that can correlate textual descriptions with visual medical data
- Incorporating audio features from podcasts for comprehensive understanding
- Specialized Models by Domain
- Cardiovascular-specific models from Series A content
- Genomics intelligence models from Series B
- Oncology-focused AI from Series C
- Immunology analysis tools from Series D
- Patient-centered applications from Series E
Potential Applications
- Drug Discovery Acceleration
- Using the Gene-Disease-Drug relationships to identify potential therapeutic targets
- Predicting drug efficacy based on genetic profiles
- Identifying repurposing opportunities for existing drugs
- Clinical Decision Support
- Creating systems that can provide evidence-based recommendations from your expert content
- Developing reasoning engines that mimic expert clinician thought processes
- Building tools that can summarize relevant research for specific patient cases
- Research Augmentation
- Creating AI research assistants that can identify gaps in current knowledge
- Developing tools to connect disparate research findings across subdisciplines
- Building hypothesis generation systems to accelerate scientific discovery
- Precision Medicine Enhancement
- Leveraging genomic data to personalize treatment recommendations
- Identifying patient subgroups most likely to respond to specific interventions
- Creating predictive models for treatment outcomes based on patient characteristics
Implementation Strategy
- Phased Development
- Starting with text-only models focusing on one domain (e.g., cardiovascular)
- Expanding to multimodal capabilities incrementally
- Scaling across all medical domains in your repository
- Evaluation Framework
- Developing specialized benchmarks using a portion of your expert content
- Creating evaluation metrics that align with real-world medical decision-making
- Implementing continuous evaluation against new medical literature
- Deployment Considerations
- Creating API interfaces for integration with existing healthcare systems
- Developing user interfaces tailored to different medical professionals
- Ensuring regulatory compliance and ethical use of AI recommendations
This approach would leverage the unique characteristics of your expert-curated content while addressing specific healthcare industry needs through focused AI applications.
Guidance offered by GPT-4.5
[OpenAI.com]
OpenAI on Aviva’s Query
LINK
Aviva’s Query:
I own 9 giga bytes of curated content in Life sciences, Pharmaceutical, medical and Healthcare. I wish to use GPT-4.5 to build a Foundational Model on my content. How can I do that
GPT-4.5 Response:
Guidance offered by Llama 405B
[Meta]
Llama on Aviva’s Query
LINK
Aviva’s Query:
I own 9 giga bytes of curated content in Life sciences, Pharmaceutical, medical and Healthcare. I wish to use Llama 405B to build a Foundational Model on my content. How can I do that
Llama 405B Response:
