Robuta

https://arxiv.org/abs/2111.02114?utm_campaign=The%20Batch&utm_source=hs_email&utm_medium=email&_hsenc=p2ANqtz-_SfDl09SZ_4IA1UMOY8ec-lV6bVDd6OPccTJTCoKiUIGXIdsMrYEv0YUO0B9VWQMY1uDXB
Abstract page for arXiv paper 2111.02114: LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
open datasetlaionclipfiltered
https://arxiv.org/html/2512.12206v1
open datasetvision transformeralertinputsize
https://aclanthology.org/2020.emnlp-main.520/
Niket Tandon, Keisuke Sakaguchi, Bhavana Dalvi, Dheeraj Rajagopal, Peter Clark, Michal Guerquin, Kyle Richardson, Eduard Hovy. Proceedings of the 2020...
open domaindatasettrackingentitiesprocedural
https://zenodo.org/records/5794107
Dataset present on Wikidata used in the frame of the LOTUS Initiative: https://doi.org/10.1101/2021.02.28.433265
natural products researchlotusinitiativeopenfrozen
https://www.tensorflow.org/datasets/catalog/waymo_open_dataset?authuser=1&hl=fa
open datasettensorflow datasetswaymo
https://www.analyticsvidhya.com/blog/2018/07/facebook-open-sources-dataset-on-nlp-and-navigation-every-data-scientist-should-download/
Facebook's AI team has released an open source dataset, called "Talk the Walk", that combines NLP with navigation data. Get your hands on it now and start...
open sourcesfacebookdatasetnlpnavigation
https://zenodo.org/records/999353
This presentation about the motivations and challenges of opening a large dataset was given at the Open Science in Practice Summer School,...
open scienceopeninglargeaudiodataset
https://speakerdeck.com/mehdidc/laion-5b-an-open-large-scale-dataset-for-training-next-generation-image-text-models
large scalelaionopendatasettraining
https://www.stlouis-mo.gov/data/datasets/distribution.cfm?id=75
Single dataset distribution detail view
open datadatasetdistributionfound
https://video.linux.it/videos/watch/30f2d678-c2e2-4b7e-9486-8e3f44398dd3
MERGE-it 2021. Sabato 22 maggio 2021. Open data, licenze e formati, dataset. Developers Italia, LibreItalia, onData, OpenStreetMap, Wikimedia Italia.
open datamergelicenzeformatidataset
https://github.com/ammolytics/cartridges
An open-source dataset of ammunition cartridge information from SAAMI, CIP, and NATO STANAG - ammolytics/cartridges
open sourcegithubcartridgesdatasetammunition
https://arxiv.org/abs/1908.06948v2
Abstract page for arXiv paper 1908.06948v2: Deep Learning for Segmentation using an Open Large-Scale Dataset in 2D Echocardiography
deep learningsegmentationusingopenlarge
https://www.nesta.org.uk/blog/top-findings-from-the-open-dataset-of-uk-makerspaces/
We've published the open dataset of UK makerspaces. What have we found and how can it be used?
open datasettopfindingsukmakerspaces
https://huggingface.co/papers/2310.06786
Join the discussion on this paper page
open datasethigh qualitypapermathematical
https://openreview.net/forum?id=itBDglVylS&referrer=%5Bthe%20profile%20of%20Siddharth%20Garg%5D(%2Fprofile%3Fid%3D~Siddharth_Garg1)
Large Language Models (LLMs) are being deployed across various domains today. However, their capacity to solve Capture the Flag (CTF) challenges in...
open sourcenyuctfbenchscalable
https://arxiv.org/abs/2402.19282
Abstract page for arXiv paper 2402.19282: WanJuan-CC: A Safe and High-Quality Open-sourced English Webtext Dataset
high qualityccsafeopen
https://deepai.org/publication/b2rl-an-open-source-dataset-for-building-batch-reinforcement-learning
09/30/22 - Batch reinforcement learning (BRL) is an emerging research area in the RL community. It learns exclusively from static datasets (i...
open sourcereinforcement learningdatasetbuildingbatch
https://deepai.org/publication/open-challenges-in-deep-stereo-the-booster-dataset
06/09/22 - We present a novel high-resolution and challenging stereo dataset framing indoor scenes annotated with dense and accurate ground-t...
open challengesdeepstereoboosterdataset
https://nestwatch.org/explore-data/nestwatch-open-dataset-downloads/
Jul 10, 2025 - Those seeking to conduct formal analyses using the NestWatch Open Dataset are encouraged to download the raw data available on this page. Note that raw data...
open dataset downloadsnestwatch
https://openreview.net/forum?id=M3Y74vmsMcY&referrer=%5Bthe%20profile%20of%20Robert%20Kaczmarczyk%5D(%2Fprofile%3Fid%3D~Robert_Kaczmarczyk1)
We present LAION-5B, an open, publically available dataset of 5.8B image-text pairs and validate it by reproducing results of training state-of-the-art CLIP...
large scalelaionopendatasettraining
https://pubmed.ncbi.nlm.nih.gov/36543828/
Anonymization has the potential to foster the sharing of medical data. State-of-the-art methods use mathematical models to modify data to reduce privacy risks....
statisticalbiasesdueanonymizationevaluated
https://www.stlouis-mo.gov/data/datasets/dataset.cfm
Single dataset detail view
open datadatasetfound
https://zenodo.org/records/13218780
Synthetic Lunar Terrain (SLT) is a dataset based on a reconstruction of a typical cratered lunar surface landscape at the EXTERRES Laboratory at University of...
lunar terrainopen datasetsyntheticmultimodaltraining
https://zenodo.org/records/4624805
Dataset comprising data from five day-ahead electricity markets: Nord pool: The Nord pool day-ahead electricity market, one of the largest European power...
open accesselectricity pricesbenchmarkdatasetday