PASTEDOWN   358   0
   1267 7.54 KB    90

Accessing IPFS datasets

By Synthbot
Created: 2022-06-24 06:07:16
Updated: 2022-07-02 12:52:29
Expiry: Never

Short version if you only want to see the dataset:

This will be extremely slow, and it may often time out. That's the Cloudflare gateway, which has no interest in speeding up access to my servers. To navigate, do NOT click on the name of the folder you want to enter. Click on the gray link that starts with "Qm" instead. If you install IPFS and directly connect to my servers, you'll be able to navigate much faster and more normally.

You can find the metadata and /mlp/ comments here: https://drive.google.com/drive/folders/1v-qOV0jUKNKdyxJzunHehMMMeln1d36f?usp=sharing. These are parquet files. You can load them using pandas in Python like this:

import pandas
dataframe = pandas.read_parquet('/path/to/parquet/file')
Introduction

This paste explains how to access datasets hosted on IPFS. IPFS is like a torrenting protocol, but it supports many features that make it better suited for hosting large amounts of data. As of Jun 25, 2022, the following data is available on IPFS:

  • derpibooru images - 3,520 GB
  • /mlp/ images - 930 GB
  • animation render traces - 141GB

If you want to download only a few examples from each, you can do it easily by navigating the relevant directory in IPFS (instructions below). If you're hoping to download special, large subsets of the data, I'm planning to develop the tools to do so in the near future (1-2 months). If you're interested or want to help out, post in the PPP thread.

Installing IPFS

You'll want to install two things:

There's no special installation process for the executable. You can keep it wherever it's convenient. The command line utility is what you'll use to manage and network IPFS on your machine. The IPFS Companion is what you'll use to browse files on IPFS. You can technically use the command line to browse files, but it's much more tedious.

Once you've downloaded the IPFS command line, open up a terminal or command prompt and run the following command to initialize its datastore:

ipfs init --profile=badgerds

This will set up an .ipfs directory in your home folder. Any data managed by IPFS will be kept there.

Running the IPFS daemon

IPFS requires a daemon to be running in the background. The daemon is the interface to your local IPFS datastore and to the rest of the IPFS network. Most commands require the IPFS daemon. You can start the daemon with the following command:

ipfs daemon

Once you see the Daemon is ready message, you'll know it's running. You can kill this process however you normally would on your OS. On Linux, it's Ctrl+C. It's probably Ctrl+C or Ctrl+Z on Windows.

If you're only using IPFS to access my server, I recommend running the following OPTIONAL commands to get IPFS to only check my servers for data. This will let IPFS find data faster by telling it not to search the public IPFS network.

ipfs bootstrap rm --all
ipfs bootstrap add /ip4/135.181.60.95/tcp/4001/p2p/12D3KooWLMr455Va1fH5XxX8EJXHJFQaSgMyaU2YzSryzV8ujBaX
ipfs bootstrap add /ip4/176.9.11.137/tcp/4002/p2p/12D3KooWMHxW4x1Dp3rjXf3UxKpH9u7XTgBfu5gzCCdyMWjHkBCg
ipfs bootstrap add /ip4/176.9.11.137/tcp/4003/p2p/12D3KooWNJCmwFGFNZGzeCeWxasrckMmJxfLEqrG6AxzSPJ4NSWd

And restart the daemon for the changes to take effect. You can switch back to the public IPFS network by running the following commands and restarting the daemon again:

ipfs bootstrap rm --all
ipfs bootstrap add --default
Connecting to my servers

IPFS will propagate requests throughout its network to try to find content. If you're using the default bootstrap list (i.e., if you did NOT follow the optional step above to modify the bootstrap list), this can be a slow process. You can speed it up by having your IPFS daemon directly connect to my servers temporarily. Since those servers have the datasets, you'll get faster responses when you're looking for content related to these datasets.

You can connect to my (three) servers with these commands:

ipfs swarm peering add /ip4/135.181.60.95/tcp/4001/p2p/12D3KooWLMr455Va1fH5XxX8EJXHJFQaSgMyaU2YzSryzV8ujBaX
ipfs swarm peering add /ip4/176.9.11.137/tcp/4002/p2p/12D3KooWMHxW4x1Dp3rjXf3UxKpH9u7XTgBfu5gzCCdyMWjHkBCg
ipfs swarm peering add /ip4/176.9.11.137/tcp/4003/p2p/12D3KooWNJCmwFGFNZGzeCeWxasrckMmJxfLEqrG6AxzSPJ4NSWd

You should see a success message after each command. If it worked, you should now be able to access my dataset through IPFS companion by navigating to this URL in your browser: ipfs://QmdMjH7EsHdd4gGgCUnssDWndf54rVXQANvaSZFnhp5Tnw.

If you did not modify the bootstrap list, you'll need to run these ipfs swarm commands every time you restart the IPFS daemon.

Downloading data

For any file you see in IPFS companion, you'll see a gray link next to it. Usually it will start with the letters Qm, but not always. If you copy the link, you'll see something like this:

  • http://localhost:8080/ipfs/QmXrFisyAvEjTUsEFzz1BEPxSS94WxXtgk413NqBjwZzXb?filename=animation

In this link, QmXrFisyAvEjTUsEFzz1BEPxSS94WxXtgk413NqBjwZzXb corresponds to the content id. You can use this content id to download the file from the command line.

ipfs pin add QmXrFisyAvEjTUsEFzz1BEPxSS94WxXtgk413NqBjwZzXb
ipfs get QmXrFisyAvEjTUsEFzz1BEPxSS94WxXtgk413NqBjwZzXb

If you only want to download the files, you can skip the "pin add" command. It's useful if you want to browse the files in IPFS companion privately (without contacting my server), if you want to mess around with the files in IPFS, or if you want to seed the data to other anons.

Known issues
  1. This setup process is a pain in the ass, especially for non-developers. I wrote a downloader that makes this easier, but it's not that good right now.
  2. Cloudflare's gateway isn't good for people that aren't Cloudflare customers, which I'm not. I plan to turn my servers into gateways for pony data. Once I do, you won't need to set up IPFS to browse the data at a reasonable speed.
  3. IPFS files are immutable. If I make changes to the dataset, which I will, the Qm hashes may become unavailable. IPNS fixes this issue, but it'll take time to find a good setup.
  4. The dataset organization is a mess. It works very well for my servers, but it's not good for people that want to browse or download the data. I can fix this once I learn to use IPFS's UnixFS, which I haven't done yet.
  5. Some of the derpibooru images are named by image id, and some of them are named by sha512 hash. This will get automatically cleaned up once I fix the dataset organization.
  6. The downloads are a lot slower than what the servers can handle. That's because the servers use an ipv4 connection, whereas the bandwidth caps on ipv6 are much higher. This is just some networking issue I'm having with docker. I plan to fix this soon, and the bootstrap (swarm peering add) information will change when I do.
  7. I haven't uploaded the XFL files yet. Those are another 500GB. I'll need to compress those first, and even then I might need to get another server disk to host them. Ditto for /mlp/ thumbnail images.
  8. I have another metadata file for linking together images across datasets, which I haven't uploaded yet. I need to update that based on images from my latest scrape, after which I'll upload it to Google Drive alongside the rest of the metadata.

Pony Preservation Project - /mlp/con 2021

by Synthbot

Pony Preservation Project - /mlp/con 2020

by Synthbot

Preservation Project History - 2020 to 2021

by Synthbot

Missing music

by Synthbot

Animation format

by Synthbot