Short version if you only want to see the dataset: * Navigate here for images and render traces http://cf-ipfs.com/ipfs/QmdMjH7EsHdd4gGgCUnssDWndf54rVXQANvaSZFnhp5Tnw This will be extremely slow, and it may often time out. That's the Cloudflare gateway, which has no interest in speeding up access to my servers. To navigate, do NOT click on the name of the folder you want to enter. Click on the gray link that starts with "Qm" instead. If you install IPFS and directly connect to my servers, you'll be able to navigate much faster and more normally. You can find the metadata and /mlp/ comments here: https://drive.google.com/drive/folders/1v-qOV0jUKNKdyxJzunHehMMMeln1d36f?usp=sharing. These are parquet files. You can load them using pandas in Python like this: ``` import pandas dataframe = pandas.read_parquet('/path/to/parquet/file') ``` ###### Introduction This paste explains how to access datasets hosted on IPFS. IPFS is like a torrenting protocol, but it supports many features that make it better suited for hosting large amounts of data. As of Jun 25, 2022, the following data is available on IPFS: * derpibooru images - 3,520 GB * /mlp/ images - 930 GB * animation render traces - 141GB If you want to download only a few examples from each, you can do it easily by navigating the relevant directory in IPFS (instructions below). If you're hoping to download special, large subsets of the data, I'm planning to develop the tools to do so in the near future (1-2 months). If you're interested or want to help out, post in the PPP thread. ###### Installing IPFS You'll want to install two things: - The IPFS Command Line utility (executable): http://docs.ipfs.io.ipns.localhost:8080/install/command-line/ - IPFS Companion (browser extension): http://docs.ipfs.io.ipns.localhost:8080/install/ipfs-companion/ There's no special installation process for the executable. You can keep it wherever it's convenient. The command line utility is what you'll use to manage and network IPFS on your machine. The IPFS Companion is what you'll use to browse files on IPFS. You can technically use the command line to browse files, but it's much more tedious. Once you've downloaded the IPFS command line, open up a terminal or command prompt and run the following command to initialize its datastore: ~~~ ipfs init --profile=badgerds ~~~ This will set up an `.ipfs` directory in your home folder. Any data managed by IPFS will be kept there. ###### Running the IPFS daemon IPFS requires a daemon to be running in the background. The daemon is the interface to your local IPFS datastore and to the rest of the IPFS network. Most commands require the IPFS daemon. You can start the daemon with the following command: ~~~ ipfs daemon ~~~ Once you see the `Daemon is ready` message, you'll know it's running. You can kill this process however you normally would on your OS. On Linux, it's Ctrl+C. It's probably Ctrl+C or Ctrl+Z on Windows. If you're only using IPFS to access my server, I recommend running the following OPTIONAL commands to get IPFS to only check my servers for data. This will let IPFS find data faster by telling it not to search the public IPFS network. ``` ipfs bootstrap rm --all ipfs bootstrap add /ip4/135.181.60.95/tcp/4001/p2p/12D3KooWLMr455Va1fH5XxX8EJXHJFQaSgMyaU2YzSryzV8ujBaX ipfs bootstrap add /ip4/176.9.11.137/tcp/4002/p2p/12D3KooWMHxW4x1Dp3rjXf3UxKpH9u7XTgBfu5gzCCdyMWjHkBCg ipfs bootstrap add /ip4/176.9.11.137/tcp/4003/p2p/12D3KooWNJCmwFGFNZGzeCeWxasrckMmJxfLEqrG6AxzSPJ4NSWd ``` And restart the daemon for the changes to take effect. You can switch back to the public IPFS network by running the following commands and restarting the daemon again: ``` ipfs bootstrap rm --all ipfs bootstrap add --default ``` ###### Connecting to my servers IPFS will propagate requests throughout its network to try to find content. If you're using the default bootstrap list (i.e., if you did NOT follow the optional step above to modify the bootstrap list), this can be a slow process. You can speed it up by having your IPFS daemon directly connect to my servers temporarily. Since those servers have the datasets, you'll get faster responses when you're looking for content related to these datasets. You can connect to my (three) servers with these commands: ``` ipfs swarm peering add /ip4/135.181.60.95/tcp/4001/p2p/12D3KooWLMr455Va1fH5XxX8EJXHJFQaSgMyaU2YzSryzV8ujBaX ipfs swarm peering add /ip4/176.9.11.137/tcp/4002/p2p/12D3KooWMHxW4x1Dp3rjXf3UxKpH9u7XTgBfu5gzCCdyMWjHkBCg ipfs swarm peering add /ip4/176.9.11.137/tcp/4003/p2p/12D3KooWNJCmwFGFNZGzeCeWxasrckMmJxfLEqrG6AxzSPJ4NSWd ``` You should see a success message after each command. If it worked, you should now be able to access my dataset through IPFS companion by navigating to this URL in your browser: `ipfs://QmdMjH7EsHdd4gGgCUnssDWndf54rVXQANvaSZFnhp5Tnw`. If you did not modify the bootstrap list, you'll need to run these ipfs swarm commands every time you restart the IPFS daemon. ###### Downloading data For any file you see in IPFS companion, you'll see a gray link next to it. Usually it will start with the letters `Qm`, but not always. If you copy the link, you'll see something like this: * `http://localhost:8080/ipfs/QmXrFisyAvEjTUsEFzz1BEPxSS94WxXtgk413NqBjwZzXb?filename=animation` In this link, `QmXrFisyAvEjTUsEFzz1BEPxSS94WxXtgk413NqBjwZzXb` corresponds to the content id. You can use this content id to download the file from the command line. ``` ipfs pin add QmXrFisyAvEjTUsEFzz1BEPxSS94WxXtgk413NqBjwZzXb ipfs get QmXrFisyAvEjTUsEFzz1BEPxSS94WxXtgk413NqBjwZzXb ``` If you only want to download the files, you can skip the "pin add" command. It's useful if you want to browse the files in IPFS companion privately (without contacting my server), if you want to mess around with the files in IPFS, or if you want to seed the data to other anons. ###### Known issues 1. This setup process is a pain in the ass, especially for non-developers. I wrote a downloader that makes this easier, but it's not that good right now. 2. Cloudflare's gateway isn't good for people that aren't Cloudflare customers, which I'm not. I plan to turn my servers into gateways for pony data. Once I do, you won't need to set up IPFS to browse the data at a reasonable speed. 3. IPFS files are immutable. If I make changes to the dataset, which I will, the Qm hashes may become unavailable. IPNS fixes this issue, but it'll take time to find a good setup. 4. The dataset organization is a mess. It works very well for my servers, but it's not good for people that want to browse or download the data. I can fix this once I learn to use IPFS's UnixFS, which I haven't done yet. 5. Some of the derpibooru images are named by image id, and some of them are named by sha512 hash. This will get automatically cleaned up once I fix the dataset organization. 6. The downloads are a lot slower than what the servers can handle. That's because the servers use an ipv4 connection, whereas the bandwidth caps on ipv6 are much higher. This is just some networking issue I'm having with docker. I plan to fix this soon, and the bootstrap (`swarm peering add`) information will change when I do. 7. I haven't uploaded the XFL files yet. Those are another 500GB. I'll need to compress those first, and even then I might need to get another server disk to host them. Ditto for /mlp/ thumbnail images. 8. I have another metadata file for linking together images across datasets, which I haven't uploaded yet. I need to update that based on images from my latest scrape, after which I'll upload it to Google Drive alongside the rest of the metadata.