Enrico Shippole Profile picture
Mar 8 30 tweets 9 min read Read on X
@TeraflopAI is excited to help support the @caselawaccess and @HarvardLIL, in the release of over 6.6 million state and federal court decisions published throughout U.S. history. Image
In collaboration with Ravel Law, @hlslib digitized over 40 million U.S. court decisions consisting of 6.7 million cases from the last 360 years into a dataset that is widely accessible to use. You can bulk download the data using the CAP API: case.law/caselaw/
It is important to democratize fair access to data to the public, legal community, and researchers. You can find a processed and cleaned version of the data available on @huggingface here: huggingface.co/datasets/Teraf…
You can find more information about accessing state and federal written court decisions of common law through the bulk data service documentation: case.law/docs/
You can learn more about the Caselaw Access Project and all of the phenomenal work done by Jack Cushman, @leppert, and @macargnelutti here: case.law/about/
During the digitization of these texts, there were erroneous OCR errors that occurred. We worked to post-process each of the texts for model training to fix encoding, normalization, repetition, redundancy, parsing, and formatting.
Teraflop AI’s data engine allows for the massively parallel processing of web-scale datasets into cleaned text form. Our one-click deployment allowed for us to easily split the computation between 1000s of nodes on our managed infrastructure.
Thank you to @nomic_ai for providing us with Atlas research credits to store and visualize each of the jurisdictions in this dataset. You can view a Nomic Atlas map of New York state court decisions here: atlas.nomic.ai/data/teraflop-…
You can access the New York jurisdiction map and all of the other @nomic_ai Atlas maps on @huggingface here: huggingface.co/spaces/Teraflo…
Nomic’s Atlas projection algorithm clusters semantically similar data together generating a topic hierarchy. You can find more information about @nomic_ai and Atlas here: docs.nomic.ai/atlas/capabili…
@nomic_ai released nomic-embed-text-v1.5, an open-source, 8192 context text embedding model. The embeddings for the Atlas maps are generated by this model. You can find more information about the model release here:
You can find the detailed research paper of the methodologies used by @zach_nussbaum, @andriy_mulyar, @jxmnop, and Brandon Duderstadt for the nomic-embed-text-v1.5 model here: static.nomic.ai/reports/2024_N…
The nomic-embed-text-v1.5 model is widely accessible on @huggingface. The model card provides training, usage, and benchmark information about the model. huggingface.co/nomic-ai/nomic…
@nomic_ai provides a library for training embedding models and reproducing the results from the research paper. The @GitHub repository can be found here: github.com/nomic-ai/contr…
We additionally provide bge-base-en-v1.5 embeddings for the first 512 tokens of each state jurisdiction and federal case law as well as the post-processed documents. Mean pooling and normalization were used for the embeddings:
huggingface.co/datasets/Teraf…
We used the Sentence Transformers library maintained by @tomaarsen of @huggingface to distribute the embedding process across multiple GPUs. You can find an example of how to use multiprocessing for embeddings here: github.com/UKPLab/sentenc…
We improved the inference throughput of the embedding process by using @tri_dao’s Flash Attention. You can find the Flash Attention repository here: github.com/Dao-AILab/flas…
You can read the research paper on the BGE embedding models by Shitao Xiao and @zzzheng_liu here: arxiv.org/pdf/2309.07597…
The code for training BGE embedding models and other great research efforts can be found on @GitHub here: github.com/FlagOpen/FlagE…
All of the datasets used to train the BGE embedding models are available here: data.baai.ac.cn/details/BAAI-M…
The bge-base-en-v1.5 model weights are available on @huggingface. The model card provides news, a list of other available models, training, usage, and benchmark information. huggingface.co/BAAI/bge-base-…
We built a FAISS index over all of the post-processed legal texts using the BGE embeddings. The index consists of ~6.6 million dense vectors and the average search speed of a query over the entire index is 12.46 milliseconds. huggingface.co/datasets/Teraf…
The FAISS library by @Meta allows you to perform k-nearest neighbor search efficiently and in a scalable way over millions of dense vectors. You can find the FAISS library here: github.com/facebookresear…
The combination of an Inverted File Index (IVF), Product quantization (PQ), and Hierarchical Navigable Small World (HNSW) allows us to run these queries across all of the dense vectors in milliseconds. You can find more information about everything here: github.com/facebookresear…
You can find all of the information here detailed in this post: teraflop-ai.notion.site/Caselaw-Access…
You can watch a live stream of the release here: lil.law.harvard.edu/about/cap-cele…
Thank you to @ShayneRedford and @RobertMahari of the MIT @medialab and Data Provenance Initiative for helping us make this connection. Please check out the post on DPI here:
You can also review the research paper on DPI here: arxiv.org/pdf/2310.16787…
Additionally, a big thank you to @jonbtow of @StabilityAI, @barry_zyj, @samcwl, @AiEleuther, Daniel Chang, and the many others who have been supportive over these last months.
We plan to release trillions of commercially licensed text tokens, images, audio, videos, and other datasets spanning numerous domains and modalities over the next months. Be sure to follow us or reach out if you require help collecting and processing data at the petabyte scale.

• • •

Missing some Tweet in this thread? You can try to force a refresh
 

Keep Current with Enrico Shippole

Enrico Shippole Profile picture

Stay in touch and get notified when new unrolls are available from this author!

Read all threads

This Thread may be Removed Anytime!

PDF

Twitter may remove this content at anytime! Save it as PDF for later use!

Try unrolling a thread yourself!

how to unroll video
  1. Follow @ThreadReaderApp to mention us!

  2. From a Twitter thread mention us with a keyword "unroll"
@threadreaderapp unroll

Practice here first or read more on our help page!

More from @EnricoShippole

Aug 31, 2023
Releasing Yarn-Llama-2-13b-128k, a Llama-2 model, trained for 128k context length using YaRN scaling. The model was trained in collaboration with u/bloc97 and @theemozilla of @NousResearch and @Void13950782 of @AiEleuther. Image
The model can be found on @huggingface here: huggingface.co/conceptofmind/…
We worked to extend the context length of the Llama-2 13b and 7b models through fine-tuning. The model passes all our evaluations and maintains the same perplexity at 128k extrapolation surpassing the performance of our other recent methodology, NTK-part scaling. Image
Read 18 tweets
Jul 24, 2023
Releasing LLongMA-2 13b, a Llama-2 model, trained at 8k context length using linear positional interpolation scaling. The model was trained in collaboration with @theemozilla of @NousResearch and @kaiokendev1. Image
The model can be found on @huggingface here: huggingface.co/conceptofmind/…
We worked directly with @kaiokendev1, to extend the context length of the Llama-2 13b model through fine-tuning. The model passes all our evaluations and maintains the same perplexity at 8k extrapolation surpassing the performance of other recent methodologies.
Read 22 tweets
Jul 20, 2023
Releasing LLongMA-2, a suite of Llama-2 models, trained at 8k context length using linear positional interpolation scaling. The model was trained in collaboration with @theemozilla of @NousResearch and @kaiokendev1. huggingface.co/conceptofmind/…
We worked directly with @kaiokendev1, to extend the context length of the Llama-2 7b model through fine-tuning. The models pass all our evaluations and maintain the same perplexity at 8k extrapolation surpassing the performance of other recent methodologies. Image
The model has similar performance to LLaMA 2 under 4k context length, performance scales directly to 8k, and works out-of-the-box with the new version of transformers (4.31) or with `trust_remote_code` for <= 4.30.
Read 20 tweets
May 25, 2023
Introducing an open-source reproduction of the FLAN V2 dataset. huggingface.co/datasets/conce…
I worked with @ShayneRedford the main author of the FLAN collection to recreate his great work and publicly release high-quality instruction tuning data. We fixed encoding issues and also increased the sequence length to 4096.
Our work on an open reproduction of FLAN V2 and related projects is all thanks to the generous sponsorship by @carperai and @StabilityAI.

A big thank you to @zhansheng and @fabmilo for helping build the dataset as well.
Read 14 tweets
May 8, 2023
Introducing three new open-source PaLM models trained at a context length of 8k on C4. Open-sourcing LLMs is a necessity for the fair and equitable democratization of AI. The models of sizes 150m, 410m, and 1b are available to download and use here: github.com/conceptofmind/…
The models are also compatible with many of Lucidrain's popular repositories such as Toolformer-pytorch, PaLM-rlhf-pytorch, and PaLM-pytorch. Please be sure to sponsor and help support Phil's great work: github.com/lucidrains/PaL…
Our work on Toolformer, PaLM, and related projects is all thanks to the generous sponsorship by @carperai and @StabilityAI.

A big thank you to @dmayhem93, @jonbtow, Aman, and @zach_nussbaum as well for providing input on the @huggingface library.
Read 13 tweets

Did Thread Reader help you today?

Support us! We are indie developers!


This site is made by just two indie developers on a laptop doing marketing, support and development! Read more about the story.

Become a Premium Member ($3/month or $30/year) and get exclusive features!

Become Premium

Don't want to be a Premium member but still want to support us?

Make a small donation by buying us coffee ($5) or help with server cost ($10)

Donate via Paypal

Or Donate anonymously using crypto!

Ethereum

0xfe58350B80634f60Fa6Dc149a72b4DFbc17D341E copy

Bitcoin

3ATGMxNzCUFzxpMCHL5sWSt4DVtS8UqXpi copy

Thank you for your support!

Follow Us!

:(