@TeraflopAI is excited to help support the @caselawaccess and @HarvardLIL, in the release of over 6.6 million state and federal court decisions published throughout U.S. history.
In collaboration with Ravel Law, @hlslib digitized over 40 million U.S. court decisions consisting of 6.7 million cases from the last 360 years into a dataset that is widely accessible to use. You can bulk download the data using the CAP API: case.law/caselaw/
It is important to democratize fair access to data to the public, legal community, and researchers. You can find a processed and cleaned version of the data available on @huggingface here: huggingface.co/datasets/Teraf…
You can find more information about accessing state and federal written court decisions of common law through the bulk data service documentation: case.law/docs/
You can learn more about the Caselaw Access Project and all of the phenomenal work done by Jack Cushman, @leppert, and @macargnelutti here: case.law/about/
During the digitization of these texts, there were erroneous OCR errors that occurred. We worked to post-process each of the texts for model training to fix encoding, normalization, repetition, redundancy, parsing, and formatting.
Teraflop AI’s data engine allows for the massively parallel processing of web-scale datasets into cleaned text form. Our one-click deployment allowed for us to easily split the computation between 1000s of nodes on our managed infrastructure.
Thank you to @nomic_ai for providing us with Atlas research credits to store and visualize each of the jurisdictions in this dataset. You can view a Nomic Atlas map of New York state court decisions here: atlas.nomic.ai/data/teraflop-…
You can access the New York jurisdiction map and all of the other @nomic_ai Atlas maps on @huggingface here: huggingface.co/spaces/Teraflo…
Nomic’s Atlas projection algorithm clusters semantically similar data together generating a topic hierarchy. You can find more information about @nomic_ai and Atlas here: docs.nomic.ai/atlas/capabili…
@nomic_ai released nomic-embed-text-v1.5, an open-source, 8192 context text embedding model. The embeddings for the Atlas maps are generated by this model. You can find more information about the model release here:
You can find the detailed research paper of the methodologies used by @zach_nussbaum, @andriy_mulyar, @jxmnop, and Brandon Duderstadt for the nomic-embed-text-v1.5 model here: static.nomic.ai/reports/2024_N…
The nomic-embed-text-v1.5 model is widely accessible on @huggingface. The model card provides training, usage, and benchmark information about the model. huggingface.co/nomic-ai/nomic…
@nomic_ai provides a library for training embedding models and reproducing the results from the research paper. The @GitHub repository can be found here: github.com/nomic-ai/contr…
We additionally provide bge-base-en-v1.5 embeddings for the first 512 tokens of each state jurisdiction and federal case law as well as the post-processed documents. Mean pooling and normalization were used for the embeddings: huggingface.co/datasets/Teraf…
We used the Sentence Transformers library maintained by @tomaarsen of @huggingface to distribute the embedding process across multiple GPUs. You can find an example of how to use multiprocessing for embeddings here: github.com/UKPLab/sentenc…
We improved the inference throughput of the embedding process by using @tri_dao’s Flash Attention. You can find the Flash Attention repository here: github.com/Dao-AILab/flas…
You can read the research paper on the BGE embedding models by Shitao Xiao and @zzzheng_liu here: arxiv.org/pdf/2309.07597…
The code for training BGE embedding models and other great research efforts can be found on @GitHub here: github.com/FlagOpen/FlagE…
The bge-base-en-v1.5 model weights are available on @huggingface. The model card provides news, a list of other available models, training, usage, and benchmark information. huggingface.co/BAAI/bge-base-…
We built a FAISS index over all of the post-processed legal texts using the BGE embeddings. The index consists of ~6.6 million dense vectors and the average search speed of a query over the entire index is 12.46 milliseconds. huggingface.co/datasets/Teraf…
The FAISS library by @Meta allows you to perform k-nearest neighbor search efficiently and in a scalable way over millions of dense vectors. You can find the FAISS library here: github.com/facebookresear…
The combination of an Inverted File Index (IVF), Product quantization (PQ), and Hierarchical Navigable Small World (HNSW) allows us to run these queries across all of the dense vectors in milliseconds. You can find more information about everything here: github.com/facebookresear…
Thank you to @ShayneRedford and @RobertMahari of the MIT @medialab and Data Provenance Initiative for helping us make this connection. Please check out the post on DPI here:
Additionally, a big thank you to @jonbtow of @StabilityAI, @barry_zyj, @samcwl, @AiEleuther, Daniel Chang, and the many others who have been supportive over these last months.
We plan to release trillions of commercially licensed text tokens, images, audio, videos, and other datasets spanning numerous domains and modalities over the next months. Be sure to follow us or reach out if you require help collecting and processing data at the petabyte scale.
• • •
Missing some Tweet in this thread? You can try to
force a refresh
Releasing Yarn-Llama-2-13b-128k, a Llama-2 model, trained for 128k context length using YaRN scaling. The model was trained in collaboration with u/bloc97 and @theemozilla of @NousResearch and @Void13950782 of @AiEleuther.
We worked to extend the context length of the Llama-2 13b and 7b models through fine-tuning. The model passes all our evaluations and maintains the same perplexity at 128k extrapolation surpassing the performance of our other recent methodology, NTK-part scaling.
Releasing LLongMA-2 13b, a Llama-2 model, trained at 8k context length using linear positional interpolation scaling. The model was trained in collaboration with @theemozilla of @NousResearch and @kaiokendev1.
We worked directly with @kaiokendev1, to extend the context length of the Llama-2 13b model through fine-tuning. The model passes all our evaluations and maintains the same perplexity at 8k extrapolation surpassing the performance of other recent methodologies.
Releasing LLongMA-2, a suite of Llama-2 models, trained at 8k context length using linear positional interpolation scaling. The model was trained in collaboration with @theemozilla of @NousResearch and @kaiokendev1. huggingface.co/conceptofmind/…
We worked directly with @kaiokendev1, to extend the context length of the Llama-2 7b model through fine-tuning. The models pass all our evaluations and maintain the same perplexity at 8k extrapolation surpassing the performance of other recent methodologies.
The model has similar performance to LLaMA 2 under 4k context length, performance scales directly to 8k, and works out-of-the-box with the new version of transformers (4.31) or with `trust_remote_code` for <= 4.30.
I worked with @ShayneRedford the main author of the FLAN collection to recreate his great work and publicly release high-quality instruction tuning data. We fixed encoding issues and also increased the sequence length to 4096.
Introducing three new open-source PaLM models trained at a context length of 8k on C4. Open-sourcing LLMs is a necessity for the fair and equitable democratization of AI. The models of sizes 150m, 410m, and 1b are available to download and use here: github.com/conceptofmind/…
The models are also compatible with many of Lucidrain's popular repositories such as Toolformer-pytorch, PaLM-rlhf-pytorch, and PaLM-pytorch. Please be sure to sponsor and help support Phil's great work: github.com/lucidrains/PaL…
Our work on Toolformer, PaLM, and related projects is all thanks to the generous sponsorship by @carperai and @StabilityAI.