Post

More from @IntuitMachine

Carlos E. Perez

@IntuitMachine

Feb 3

1/n Discovered this book (h/t @Extended_Brain). Let's look into some nuggets of wisdom!

2/n In his chapter on "Personal Knowledge," Michael Polanyi argues that all knowledge involves personal participation and commitment on the part of the knower. He introduces the concept of "tacit knowing" to describe the process by which personal knowledge is accumulated. Tacit knowing stands in contrast to the ideals of detached objectivity and value neutrality often associated with scientific knowledge.

At the heart of tacit knowing is subsidiary awareness—attending to one thing by focusing on another related or connected thing. For example, we may identify a person by his clothes, or we attend to the weight of a hammer in our palm as we focus on driving the nail. What we are focally aware of and what we are subsidiarily aware of mutually depend on each other in tacit knowing. Our subsidiary awareness of clues, instruments, and context allows us to comprehend the focal target, while the target itself determines what counts as clues or instruments relevant to discerning its nature.

Tacit knowing pervades multiple forms of skillful achievement, including practical skills like cycling and swimming but also more abstract capabilities like reading comprehension or facial recognition. It has a from-to structure—we go from perception of subsidiaries to comprehension of a coherent whole. This always involves our active shaping and organizing of subsidiaries to integrate them for meaning.

Polanyi identifies three key aspects to tacit knowing: functional, phenomenal, and semantic. The functional aspect is the from-to relation itself and how we dwell in the particulars to attend to the whole. The phenomenal aspect is that through integrative acts like binocular vision or reading, we achieve a new phenomenal experience beyond what direct inspection of the parts would indicate. Finally, the semantic aspect is the meaning-giving relationship where subsidiaries acquire their sense by bearing on the focus.

An important implication is that all knowledge depends on personal judgment to turn clues into comprehension. There are no explicit rules determining what coheres or what is meaningful. As Polanyi puts it, "into every act of knowing there enters a tacit and passionate contribution of the person knowing what is being known." While aiming at an external reality, our understanding relies fundamentally on internal processes of integration that connect knower and known. Tacit knowing is an inescapable and universal feature of human knowledge.

3/n The Reconstruction chapter explores how Polanyi's theory of personal knowledge and tacit integration can help reconstruct our understanding of science and knowledge after the damage done by positivism and radical skepticism. He wants to show how personal participation and intuition are essential to science.

A major target is the mistaken ideal of detached objectivity. Polanyi argues that all knowledge depends on commitments, beliefs, and personal judgments that shape how we integrate clues into coherence and meaning. Even facts of science rely on scientists skillfully reading instruments in ways that involve unspecifiable personal elements. There is no totally explicit, impersonal kind of knowing.

Imagination and intuition play crucial roles in this view of science:

1. Intuition guides recognition of a problem and assessment of whether it is promising to pursue based on subtle clues. This "strategic intuition" shapes the basic vision scientists have for where truth may lie hidden.

2. Imagination then drives persistent efforts to search for clues and piece together patterns toward a possible solution, in a quest guided broadly by intuition about what seems plausible and meaningful.

3. Finally, intuition spontaneously offers an integrative vision that may solve the problem, often after unconscious incubation of ideas mobilized by the imagination. This "concluding intuition" provides the fulfillment of meaning.

So intuition sets the direction, imagination does the hard work, and intuition synthesize the fruits of inquiry. This cycle of imagination and intuition leads potentially to moments of discovery and insight that scientifically reveal reality.

For Polanyi, imagination creatively anticipates realities that may manifest themselves in the future. Intuition senses possibilities for systematic meaning by dwelling in the implications of existing knowledge. Together, they push science to expand into the unknown guided by a slope of deepening meaning, rather than just accumulating facts.

This theory of discovery opposes mechanical views of scientific reasoning. It requires accepting non-explicit personal judgments of coherence at the heart of science. Appreciating the role of imagination and intuition allows restoring a richer understanding of scientific inquiry as an open-ended human process aiming to articulate hidden realities.

The key is recognizing that all knowledge involves skillful integration of particulars guided by ideals of coherence and purpose. Science is inescapably shaped by the personal participation of dedicated thinkers seeking meaningful truths about the world.

Read 6 tweets

Carlos E. Perez

@IntuitMachine

Feb 2

1/n A Taxonomy for Multi-Modal Large Language Models

Architecture
The architecture consists of 5 key components:

1. Modality Encoder: Encodes inputs from modalities like image, video, audio into feature representations. Common options include NFNet-F6, ViT, CLIP ViT, C-Former, etc.

2. Input Projector: Aligns non-text modality features to the text feature space of the LLM. This uses cross-attention, Q-Former, P-Former, or simple MLPs/linear layers.

3. LLM Backbone: Core large language model that processes aligned multi-modal representations and generates textual outputs + signal tokens for conditional generation. Popular choices are Flan-T5, Vicuna, OPT, LLaMA, etc.

4. Output Projector: Maps signal token representations into features that can be understood by the Modality Generator. Uses a Tiny Transformer or MLP.

5. Modality Generator: Generates outputs in modalities like image, video, audio conditioned on the mapped features. Typically uses off-the-shelf latent diffusion models like Stable Diffusion, AudioLDM, etc.

Training Pipeline:
The training pipeline has 2 key stages -

1. Multi-Modal Pre-Training: trains the Input and Output Projectors using image-text, video-text, audio-text datasets to align modalities. May fine-tune small trainable parameters in LLM backbone using methods like prefix tuning.

2. Multi-Modal Instruction Tuning: further trains the model on instruction-formatted datasets using reinforcement learning from human feedback. This enhances model's alignment with human preferences and interaction capabilities.

2/n The input process flow

1. Modality Encoder:
- Encodes inputs from modalities like image, video, audio into feature representations.
Example:
- Input: An image of a cat
- CLIP ViT encoder encodes it into a 768-d feature vector representing the visual concepts in the image

2. Input Projector
- Projects non-text modality features into the textual feature space of LLM
Example:
- The 768-d cat image feature from CLIP ViT
- A linear layer projects it into a 1024-d vector aligned with text vector space
- Other options like cross-attention, Q-Former can also achieve this alignment

3. LLM Backbone
- Core large language model that processes the aligned multi-modal representations
Example:
- The 1024-d projected cat image feature vector
- Textual caption describing the image: "A cute cat playing with a ball of yarn"
- These text and image features are fed into the LLM backbone like OPT or LLaMA
- The LLM encodes them into a joint representation in its latent space and generates relevant outputs

So in summary, the modality encoders create non-text representations, input projectors transform them into an LLM-compatible space, and LLM backbone fuses information from all aligned modalities to understand concepts across modalities. The flow enables the fusion of multi-modal knowledge into the LLM.

2/n the output process flow

Here is an explanation of the flow from LLM backbone to output projector to modality generator with examples:

1. LLM Backbone
- Core large language model that processes aligned multi-modal representations
- Can generate text describing desired outputs in other modalities

Example:
- Encoded features of an image of a dog along with textual caption
- LLM backbone (like PaLM) generates text: "generate a 1920x1080 image morphing the dog into a cat"

2. Output Projector
- Maps the text encoding from LLM backbone into features compatible with target modality

Example:
- The text encoding from LLM backbone representing the "morph dog into cat" instruction
- An MLP output projector transforms it into a latent feature vector

3. Modality Generator
- Generates outputs in target modalities conditioned on the projected features

Example:
- Latent vector representation of "morph dog into cat" instruction
- Stable Diffusion image generator uses that conditioning vector
- Generates a 1920x1080 image morphing the dog image into a cat by neural rendering

So in summary, the LLM backbone generates descriptive texts of desired outputs, output projectors transform those text features into compatible latent spaces, and modality generators use those features to synthesize novel outputs. This enables multi-modal generative capabilities via text-conditioning.

Read 4 tweets

Carlos E. Perez

@IntuitMachine

Feb 1

1/n Introducing RAPTOR

Existing RAG methods suffer from a major limitation: they can only retrieve short, contiguous passages of text. This restricts their capacity to represent cross-document discourse structure and leverage thematic information scattered across lengthy corpora. As a result, performance suffers on complex questions requiring multi-step inference or synthesis of knowledge from multiple sections.

Fixed language models also face challenges staying up-to-date, as baking vast world knowledge into model parameters makes it arduous to edit or append facts. Yet relying on outdated embedded knowledge severely impairs real-world reliability and accuracy.

This paper introduces RAPTOR, a novel recursive abstraction paradigm that overcomes both issues through hierarchical multi-document representation. RAPTOR segments text, then recursively clusters, summarizes, and embeds passages. This structures corpora into multi-layer trees encoding information at varying levels of abstraction.

Querying this rich tree representation allows integrating details and high-level themes simultaneously. Controlled experiments exhibit consistent improvements over baseline retrievers across several QA datasets. Moreover, by augmenting powerful readers like GPT-4, RAPTOR reaches new state-of-the-art results on multifaceted reasoning tasks requiring nuanced understanding of lengthy narratives.

Modularizing knowledge into RAPTOR’s index also facilitates updating world facts. As corpus contents evolve, the reader persists unaltered, flexibly adapting to current information needs. This crucial agility makes RAPTOR invaluable for dynamic real-world deployments.

In summary, RAPTOR provides a sorely lacking solution for multi-document reasoning and updatable retrieval-based QA. Leveraging recursive summarization and abstraction, it encodes corpora with sufficient semantic depth for complex queries. RAPTOR delivers substantial gains; its strong empirical performance confirms the merits of tree-based hierarchical retrieval augmentation.

2/n The RAPTOR process:

1. Text Segmentation
- Split retrieval corpus into short, contiguous chunks of 100 tokens, similar to traditional methods
- Keep sentences intact even if over 100 tokens to preserve coherence

2. Text Embedding
- Embed text chunks using SBERT to get dense vector representations

3. Clustering
- Employ soft clustering using Gaussian Mixture Models and UMAP dimensionality reduction
- Vary UMAP parameters to identify global and local clusters
- Use Bayesian Information Criterion for model selection to determine optimal number of clusters

4. Summarization
- Summarize the chunks in each cluster using a language model
- Results in a condensed summary capturing key information

5. Node Creation
- Clustered chunks + corresponding summary = new tree node

6. Recursive Processing
- Repeat steps 2-5: Re-embed summaries, cluster nodes, generate higher level summaries
- Forming a multi-layer tree from the bottom up
- Until clustering is infeasible (final root node summarizes the entire corpus)

7. Retrieval
- Two methods: tree traversal (top-down layer by layer) or collapsed tree (flattened view)
- For each, compute cosine similarity between query and nodes to find most relevant

So in summary, RAPTOR leverages recursive clustering and summarization of text chunks to create a hierarchical tree structure for more effective contextual retrieval.

3/n Summary of key related work

Retrieval Methods
- Use standard chunking to index passages
- RAPTOR creates recursive tree structure with hierarchical summarization

Joint Passage Retrieval
- Tree decoding to handle passage diversity
- RAPTOR clusters semantically related passages

Summarization Models
- Recursive summarization using task decomposition
- RAPTOR allows flexible grouping and keeps intermediate details

Dense Hierarchical Retrieval
- Combines document and passage retrievals
- RAPTOR focuses on passage-level, adds recursive abstraction

Long Context Models
- Expand context lengths models can handle
- RAPTOR provides relevant subsets of text

Read 7 tweets

Carlos E. Perez

@IntuitMachine

Jan 30

1/n Exploiting Large Language Models (LLMs), RAG and KGs for Creative Design

A recent paper makes a compelling case for the tremendous yet untapped potential of large language models (LLMs) to transform materials science research. However, the authors thoughtfully acknowledge critical "pain points" in relying solely on the raw capabilities of LLMs in this complex domain. Accuracy, nuance, interpretability, reasoning - on all these fronts, LLMs fall short without a guiding hand.

That's exactly why this paper shines. It outlines strategies to partner with LLMs to elicit their strengths while overcoming weaknesses. Retrieval augmentation (RAG) provides lacks context to ground responses. Knowledge graphs (KGs) organize concepts ontologically to lend structure and meaning. Non-linear prompting channels creativity through critical filters. Diverse model collectives enable cooperative discovery.

What emerges is a vision for a new paradigm - LLMs not as opaque oracles, but as flexible components in an intelligible, distributed materials discovery infrastructure. One where human researchers set the objectives, models rapidly compound knowledge through code and data, and reciprocal feedback loops drive exploration.

This paper thus makes a timely case. That to fully actualize the manifest benefits of AI in advancing materials science, we must raise these powerful models to collaborators in a hybrid intelligence system built on transparency, trust, and shared creativity fueled by human curiosity.

2/n Main strategies covered:

1) Retrieval-augmented generation (RAG) methods to inject additional knowledge into the generative process to improve accuracy. RAG is highlighted as a powerful approach, especially when combined with graph-based methods.

2) Ontological knowledge graphs to provide interpretable structure that captures concepts and relationships. This facilitates mechanistic insights and more detailed responses from the LLM.

3) Nonlinear sampling techniques like tree-of-thought prompting to iteratively refine and improve responses, overcoming limitations of single-shot linear sampling.

4) Multi-agent models where specialized LLMs collaborate and interact autonomously to solve complex multimodal problems. Illustrates promise for advanced applications like automated force-field development.

3/n RAG methods

1) MechGPT is tested on a series of domain-specific questions related to materials failure. It shows reasonable performance and provides accurate answers to these complex questions from its trained knowledge.

2) Retrieval augmented generation (RAG) is then used, where relevant context from a corpus is provided along with the question to MechGPT. This does not significantly improve the answers in this case since MechGPT has already been well-trained on the mechanics failure knowledge required for the questions.

3) An edge case is shown where MechGPT fails to provide accurate info on "molybdenene", a recently published material not in its training data. It incorrectly states molybdenene is theoretical.

4) Using RAG with the molybdenene paper as the knowledge source leads to much improved responses. Multiple detailed Q&A pairs demonstrate MechGPT can now accurately describe key features like the square lattice structure and predicted brittle fracture behavior.

In summary, MechGPT shows reasonable domain-specific question answering performance. RAG improves responses for out-of-training-distribution cases but does not further enhance in-distribution performance. This highlights strengths but also limitations of relying solely on an LLM's parameter-based knowledge.

Read 8 tweets

Carlos E. Perez

@IntuitMachine

Jan 23

1/n The most important civic duty that a nation can instill in its citizens is the importance of life-long learning. This goes beyond access to education for our children. It involves a culture that leans toward healthy collaboration and drives toward sustained innovation.

2/n It is no surprise that so many citizens feel left out in today's system. People have never learned the skills to learn independently. But AI radically remedies this deficit! GPT-like systems are tireless teachers who can adapt their conversations to a student's cognitive biases and limitations.

3/n All learning agents frame their understanding by projecting their observations into perspectives that their minds have previously adopted. We are agents with learned cognitive biases. Furthermore, these biases are encoded and reinforced in language.

Read 11 tweets

Carlos E. Perez

@IntuitMachine

Jan 19

1/n Let's talk about Flow Enginneering that's discussed in the AlphaCodium paper:

The paper introduces the concept of "flow engineering" to characterize their proposed approach of AlphaCodium, and contrasts it with typical "prompt engineering" methods. The use of the term "flow engineering" can be justified in the following ways:

1. Multi-stage iterative process: AlphaCodium involves a structured, test-driven flow with progressive stages - problem analysis, test generation, initial coding, and iterative run-fix cycles. This goes beyond crafting an optimal prompt.

2. Incorporating code execution: The flow deeply integrates execution of the generated code against input-output examples into the modeling process, rather than purely focusing on static prompt tuning. This dynamic run-fix iteration on increasing tests sets it apart.

3. Scaffolding code development: The multi-step methodology provides a scaffolding that mirrors the software development process by incrementally going from specifications to code, resembling test-driven cycles.

4. Code-centric techniques: Several techniques tailor-made for code tasks supplement the basic flow - modular code prompting, test anchors prevent code divergence, output validation using test suites.

5. Knowledge accumulation: Each stage in the AlphaCodium flow builds up artifacts, learnings and validated components which are accumulated to aid downstream steps - a departure from one-off prompt engineering.

In summary, the use of the term "flow engineering" underscores the process-centric, execution-backed, and code-aware nature of the methodology going beyond static prompt design. It better captures the iterative, test-driven, development-mimetic essence.

2/n This paper is entirely fascinating in that it introduces an entirely novel way of viewing subsequent reasoning processes that influence both long chains of inference as well as subsequent retraining.

The paper proposes several code-oriented design concepts and best practices:

1. YAML Structured Output:
- Ask the model to generate output in YAML format conforming to a given Pydantic class definition.
- Eliminates need for complex prompt engineering, allows complex structured answers.
- More suitable than JSON for code due to handling of quotes, special chars etc.

2. Semantic Reasoning via Bullet Points:
- When asking the model to reason about a problem, use bullet point format.
- Forces splitting into logical sections, improves understanding.

3. Modular Code Generation:
- Ask the model to divide code into small sub-functions with meaningful names.
- Results in better code quality, easier iterative fixing.

4. Soft Decisions with Double Validation:
- Avoid strict decisions by the model which lead to hallucinations.
- Double validate potentially erroneous outputs.

5. Postponing Decisions and Exploration:
- Gradually move from easier to harder tasks, avoiding irreversible decisions early on.
- Leave room for exploring multiple possible solutions.

6. Test Anchors:
- Fix codes incorrectly when iterating on potentially invalid AI-generated tests.
- Use already passed tests as anchors to detect erroneous fixes.

It incorporates many of the best practices of agile software development in a machine learning optimization process!

3/n I'm pleasantly surprised to discover insightful principles like the soft decisions with double validation"

1. Motivation:
- Language models often struggle when required to make strict, non-trivial decisions regarding complex issues.
- This leads to hallucinations and erroneous answers.

2. Technique:
- Avoid asking direct yes/no questions about complicated problems.
- Instead, adopt a gradual flow from easier to harder tasks.

- For example, when generating additional tests for a problem:
- First generate tests, then validate them.
- Rather than asking "is this test correct"?

- Use double validation:
- Given a generated output, ask the model to regenerate it while correcting errors.
- Encourages critical reasoning rather than yes/no judgement.

3. Example:
- When generating additional input-output tests for a coding problem:
- Firstly generate tests covering aspects missed by public tests.
- Then show the generated tests back to the model.
- Ask it to regenerate the tests while fixing any errors.

4. Benefits:
- Avoids strict decisions prematurely.
- Allows open-ended exploration first, validates later.
- Double validation improves quality by self-correction.

So in summary, the key ideas are to avoid rigid decisions early on, gradually build knowledge, validate potentially erroneous
outputs by regenerating them while allowing corrections - instead of demanding yes/no judgements. This technique of "soft decisions" coupled with "double validation" improves models' reasoning abilities.

Read 7 tweets

Share this page!

Enter URL or ID to Unroll

Carlos E. Perez

Try unrolling a thread yourself!

More from @IntuitMachine

Carlos E. Perez

Carlos E. Perez

Carlos E. Perez

Carlos E. Perez

Carlos E. Perez

Carlos E. Perez

Did Thread Reader help you today?

Don't want to be a Premium member but still want to support us?

Send Email!