Dataset Distribution by Domain
Dataset Distribution by Task
Dataset Distribution by Artifact Type
Dataset Distribution by RE Stage
Dataset Size Distribution
Language Distribution
Dataset Publication Timeline
License Distribution
Granularity Distribution
RE Stage vs Task Distribution
Key Insights for LLM4RE Tasks
Domain Coverage: The dataset collection spans multiple domains, with the most represented being aerospace and healthcare.
Task Diversity: Classification tasks dominate the collection, representing over 40% of all datasets.
Data Granularity: The collection includes datasets at multiple granularity levels, with document-level and sentence-level being most common.
Language Support: While English dominates with over 80% of datasets, there is growing support for multilingual datasets including Chinese and other languages.
Size Distribution: Dataset sizes vary significantly, from small datasets with under 100 items to large-scale collections with over 400,000 items, providing diverse training scenarios for LLMs.
Temporal Trends: The collection shows increasing activity in recent years, with over 60% of datasets published in the last 3 years, reflecting growing interest in LLM4RE research.
RE Stage Coverage: Datasets cover all major RE stages, with particular emphasis on analysis and verification tasks.
Openness: Over 70% of datasets are available under open licenses, promoting reproducibility and collaborative research in the LLM4RE community.