LLM DATASET

lightsong / 2024-09-21 / 原文

大模型的能力来源

https://arxiv.org/pdf/2402.18041

 

 

大模型合规来源

https://arxiv.org/html/2402.12193v2

 

 

 

 

大模型的罪恶检测来源

https://www.kaggle.com/datasets/odins0n/ucf-crime-dataset/data

 

 

code math

https://github.com/mlabonne/llm-datasets

 

Math & Logic

LLMs often struggle with mathematical reasoning and formal logic, which has led to the creation of specialized datasets. These datasets extend beyond pure mathematics, encompassing a wide range of problems that require systematic thinking and step-by-step reasoning, ultimately enabling LLMs to tackle complex real-world challenges that involve logical deduction and quantitative analysis.

Dataset#AuthorsDateNotes
OpenMathInstruct-1 5.75M Toshniwal et al. Feb 2024 Problems from GSM8K and MATH, solutions generated by Mixtral-8x7B
MetaMathQA 395k Yu et al. Dec 2023 Bootstrap mathematical questions by rewriting them from multiple perspectives. See MetaMath paper.
MathInstruct 262k Yue et al. Sep 2023 Compiled from 13 math rationale datasets, six of which are newly curated, and focuses on chain-of-thought and program-of-thought.
Orca-Math 200k Mitra et al. Feb 2024 Grade school math world problems generated using GPT4-Turbo. See Orca-Math paper.

Code

Code is another challenging domain for LLMs that lack specialized pre-training. Code datasets, containing diverse programming language examples, are used to fine-tune LLMs and enhance their ability to understand, generate, and analyze code, enabling them to serve as effective coding assistants.

Dataset#AuthorsDateNotes
CodeFeedback-Filtered-Instruction 157k Zheng et al. Feb 2024 Filtered version of Magicoder-OSS-Instruct, ShareGPT (Python), Magicoder-Evol-Instruct, and Evol-Instruct-Code.
Tested-143k-Python-Alpaca 143k Vezora Mar 2024 Collection of generated Python code that passed automatic tests to ensure high quality.
glaive-code-assistant 136k Glaive.ai Sep 2023 Synthetic data of problems and solutions with ~60% Python samples. Also see the v2 version.
Magicoder-Evol-Instruct-110K 110k Wei et al. Nov 2023 A decontaminated version of evol-codealpaca-v1. Decontamination is done in the same way as StarCoder (bigcode decontamination process). See Magicoder paper.
dolphin-coder 109k Eric Hartford Nov 2023 Dataset transformed from leetcode-rosetta.
synthetic_tex_to_sql 100k Gretel.ai Apr 2024 Synthetic text-to-SQL samples (~23M tokens), covering diverse domains.
sql-create-context 78.6k b-mc2 Apr 2023 Cleansed and augmented version of the WikiSQL and Spider datasets.
Magicoder-OSS-Instruct-75K 75k Wei et al. Nov 2023 OSS-Instruct dataset generated by gpt-3.5-turbo-1106. See Magicoder paper.
Code-Feedback 66.4k Zheng et al. Feb 2024 Diverse Code Interpreter-like dataset with multi-turn dialogues and interleaved text and code responses. See OpenCodeInterpreter paper.
Open-Critic-GPT 55.1k Vezora Jul 2024 Use a local model to create, introduce, and identify bugs in code across multiple programming languages.
self-oss-instruct-sc2-exec-filter-50k 50.7k Lozhkov et al. Apr 2024 Created in three steps with seed functions from TheStack v1, self-instruction with StarCoder2, and self-validation. See the blog post.