LLM4RE Datasets - Analytics Dashboard

Dataset Distribution by Domain

Dataset Distribution by Task

Dataset Distribution by Artifact Type

Dataset Distribution by RE Stage

Dataset Size Distribution

Language Distribution

Dataset Publication Timeline

License Distribution

Granularity Distribution

RE Stage vs Task Distribution

Key Insights for LLM4RE Tasks

Domain Coverage: The dataset collection spans multiple domains, with the most represented being aerospace and healthcare.

Task Diversity: Classification tasks dominate the collection, representing over 40% of all datasets.

Data Granularity: The collection includes datasets at multiple granularity levels, with document-level and sentence-level being most common.

Language Support: While English dominates with over 80% of datasets, there is growing support for multilingual datasets including Chinese and other languages.

Size Distribution: Dataset sizes vary significantly, from small datasets with under 100 items to large-scale collections with over 400,000 items, providing diverse training scenarios for LLMs.

Temporal Trends: The collection shows increasing activity in recent years, with over 60% of datasets published in the last 3 years, reflecting growing interest in LLM4RE research.

RE Stage Coverage: Datasets cover all major RE stages, with particular emphasis on analysis and verification tasks.

Openness: Over 70% of datasets are available under open licenses, promoting reproducibility and collaborative research in the LLM4RE community.

LLM4RE Datasets Analytics Dashboard