Initial source commit

This commit is contained in:
Tony Bark 2025-10-03 02:19:59 -04:00
commit f1384c11ee
335 changed files with 52715 additions and 0 deletions

View file

@ -0,0 +1,196 @@
November 8, 2002
Ideas for low-overhead anonymous publishing.
Freenet problem:
In a system where a data retrieval can take up to 10 network hops, redirects
are rediculous (10 hops to find the pointer file, 10 more hops to find
the actual data). In Freenet, since SSKs are common, redirects are used
for almost every piece of data in the system.
From the start, the network should be designed to handle the most common
usage scenareo (fetching the latest version of an anonymous author's content),
not to handle all sorts of other possible scenareos.
The goal of such a system should be usable anonymous publication and
retrieval. We should not focus (as Freenet does) on side issues such as
retrieval of past document versions or elimination of data redundancy.
Simplest solution:
Sign(K,D) signs data D with key K.
Encr(K,D) encrypts data D with key K.
Hash(D) produces a fixed-length hash of data D.
A | B concatonates A and B.
An anonymous author generates a private key R and a public key U.
The author wants to post a subspace file "index.html", with content C and
timestamp T.
The author generates a random symmetric key K.
The author generates the following pointer URL:
K/U/index.html
The author generates the following insertion data:
T | Encr(K,C) | Sign( U, T | Encr(K,C) )
The author generates the following insertion URL:
U/index.html
Problems: U, the author's public key, is visible in the insertion URL. Thus,
nodes can selectively block particular authors. The second problem is
that the "file name" is visible in the insertion URL, allowing nodes to
selectively block certain files.
What if the insertion URL looks like this:
Hash(U) + Hash("index.html")
Problem:
There is no way for storage nodes to verify that the inserted data actually
matches the insertion URL.
We have several reasons for assuming that we have one priv/pub key pair per
author:
1. This allows authors to build an anonymous publishing identity and
repuation, since all URLs that contain a particular U certainly correspond
to data posted by a particular author.
2. Generating priv/pub key pairs is computationally expensive, so we
want to do this as infrequently as possible.
3. Authors only need to manage one priv/pub key pair to publish in the
network.
Thus, given all of the assumptions and issues raised above, we can see why
the Freenet paradigm uses redirects.
However, we should note the following points about this paradigm:
1. Using redirects *doubles* data fetch time in the worst case.
2. Generating key pairs, though expensive, only takes a few seconds on
modern computing hardware.
3. Freenet data fetch times, in practice, are take tens of seconds.
4. Each data post operation may be followed by many fetch operations for
that data.
5. Readers, in general, do not compare the public keys in a URLs to judge
whether or not two pieces of content were posted by the same author. They
generally look at whether or not the two content "links" were grouped together
on the same "homepage". I.e., they assume that all data grouped together
into a "freesite" was posted by one author. (This is a fair assumption to make,
since only one author posted the main page of the freesite, so it is safe
to assume that the author chose to group the content links together.)
Dropping the assumtions about keys stated earlier, we can devise a publishing
mechanism with better properties:
An anonymous author wants to post a new file "index.html", with content C and
timestamp T.
The author generates a new private key R and a public key U.
The author generates a random symmetric key K.
The author generates the following pointer URL:
K/U
The author generates the following insertion data:
T | Encr(K,C) | Sign( U, T | Encr(K,C) )
The author generates the following insertion URL:
U
Whenever the author wants to post an updated version of this file, s/he uses
the same R/U pair, inserting new content C' with a new timestamp T'. Storage
nodes can check timestamps for key collisions and keep only the latest version
of a content unit.
This scheme has the following properties:
1. Storage nodes can verify that content matches a key by checking the
signature included with the content.
2. Storage nodes can obtain a secure timestamp for each unit of content.
3. Storage nodes do not have access to unencrypted content.
4. Storage nodes cannot tie a piece of content to any particular author.
3. Readers, using the K/U URL form, can fetch the content (using U), verify
its signature and timestamp, and decrypt it (using K).
We might observe that, with this scheme, readers have lost the ability to
associate a collection of content with a particular author, since each
unit of content is signed with a different private key.
However, we propose the following simple solution to this problem:
Authors can insert a "collection" document with links to all of their work.
This is similar to a homepage on the web.
Since an author can securely update and maintain their collection document,
readers can be sure that all of the work pointed to by the collection document
was actually collected by the same author.
Note that with *any* scheme, there is no way to guarentee authorship, so
nothing really is lost here. With Freenet, all we can know for sure is that
the same person *posted* each of a particular series of documents. In our
system, we know for sure only that the same person *linked to* each of a
particular series of documents. Since there is nothing all that sacred about
posting a document, we claim that nothing sacred is lost.
However, for each document posted, a new priv/pub key pair must be generated.
This can be computationally expensive and time-consuming for a poster. To
deal with this issue, each node can build a collection of fresh priv/pub key
pairs using spare computation cycles (or a thread with low priority) so that
fresh keys are ready whenever an author wishes to publish new content.
We should note the following trade-offs:
We gain:
The ability to publish and retrieve content securely and anonymously without
using redirects, while still allowing for reputation building and anonymous
identity.
We lose:
The ability to factor redundant content out of the network.
However, the only want to factor redundant content out of the network while
still allowing for high-level pointers to content is by using redirects
(a high-level pointer that redirects you to a low-level, content hash key).

View file

@ -0,0 +1,88 @@
Generic peer-to-peer protocol.
White space:
All white space is equivalent (tabs, line breaks, spaces, etc.)
Two types of commands, GET and PUT.
Command form:
GET return_address id_number resource_descriptor
PUT id_number resource
return_address = {xxx.xxx.xxx.xxx | hostname}:port_number
resource_descriptor = resource_type [description]
description = resource specific description data may be optional for certain resource types
resource = resource_type resource_data
A PUT command corresponding to a GET command (in other words, a put command sharing the same ID as the GET command) can be of one of three types:
1. A resource matching the resource descriptor specified by the GET command.
2. A GET_REQUEST_FAILURE resource.
3. A MORE_KNOWLEDGEABLE_HOSTS resource.
Examples:
See resourceSpecifications.txt for information about the resource types used in these examples.
A GET request for a server list:
GET myhost.mydomain.com:5157 1 SERVER_LIST
The corresponding PUT command:
PUT 1 SERVER_LIST 3 server1.domain1.com:5157 server2.domain2.com:5157 server3.domain3.com:5157
A GET request for a search:
GET myhost.mydomain.com:5157 2 SEARCH 1 FILE 3 termA termB termC
A PUT command from a successful search:
PUT 2 SEARCH 2 FILE filehost1.domain1.com:5157 c:/files/termAtermBtermC.txt FILE filehost2.domain2.com:5157 /home/john/myfiles/termAtermBtermC.txt
A PUT command for a failed search:
PUT 2 GET_REQUEST_FAILED SEARCH 1 FILE 3 termA termB termC
A PUT command informing us of a more knowledgeable search host:
PUT 2 MORE_KNOWLEDGEABLE_HOSTS 1 searchhost.domain.com:5157
We can request more knowlegeable hosts explicitly:
GET myhost.mydomain.com:5157 3 MORE_KNOWLEDGEABLE_HOSTS SEARCH 1 FILE 3 termA termB termC
The corresponding PUT command:
PUT 3 MORE_KNOWLEDGEABLE_HOSTS 1 searchhost.domain.com:5157
A GET request for a file
GET myhost.mydomain.com:5157 4 FILE filehost1.domain1.com:5157 c:/files/termAtermBtermC.txt
The corresponding PUT command:
PUT 4 FILE c:/files/termAtermBtermC.txt 19 ###this is a text file

View file

@ -0,0 +1,34 @@
Greetings,
I had a bit of correspondence with you about a year ago concerning "konspire", a distributed file sharing system that I was developing. As you may be aware, konspire never caught on. konspire failed to be popular for several reasons, but the major reason was that it was written using Java and had no natively compiled client or server available. Most people don't like to mess with getting a Java application up and running. Another problem was ease of use: konspire didn't work immediately "out of the box", as users had to enter the address of a live server the first time they ran konspire. Finally, there was a problem that never rose to the surface, which was scalability: had konspire become popular, the indexing scheme being used would have limited the network to about 100,000 files.
When I was developing konspire, I was focusing on "complete searching" capabilities. When you got search results back using konspire, you could be certain that you were seeing a list of all files in the system that matched your search. With Gnutella, results trickled back over time, and you were never sure when they would stop arriving. With Napster, you could only search through files on one of their sub-servers.
Of course, both Gnutella and Napster scale very well: in theory, the number of nodes in the network has no limit (though you can only search through a fraction of the available files).
With konspire, each designated "server" node keeps a complete copy of the system-wide file index. The size of the index is bounded by the size of the memory available to the most memory-limited server. Thus, konspire does not scale well at all. Furthermore, even if the limit is not hit, who wants to donate their system's entire memory to the konspire file index? Add the 20MiB+ base memory requirements of Java to this, and konspire servers are major memory hogs.
For a while, plans for konspire v1.2 were in place, and the new protocol would attempt to fix the scaling problems by splitting each copy of the index among a cluster of server nodes. Keeping things organized gets very messy, though. In the mean time, much simpler systems like Gnutella started to work well, once they had a larger consistent user base and a more connected network. In fact, you can find almost anything these days using a BearShare client... why bother with a smarter protocol?
FastTrack improved the scalability with ideas similar to those of konspire v1.2, but they threw in a nice self-organization concept: self-organization frees novice users from making decisions (many users were endlessly confused about the difference between clients and servers in konspire). FastTrack has a huge problem, though: they are commercial and proprietary. You would think that these companies would learn their lesson from the Napster debacle. Do they think the RIAA will look the other way? Notice that the RIAA has not even touched Gnutella, since they have no one to point a suit at.
The FastTrack protocol (supposedly) is also very specific to the way FastTrack organizes a network. Network organization and peer communication protocol seem like they could be completely separated. The kind of messages sent in a Napster network are pretty much the same as the kind of messages sent in a Gnutella network. The differences lie in how nodes behave and what they do upon receipt of a particular message. Think of a search request message: Gnutella nodes search themselves and pass the request on to other nodes; FastTrack nodes pass the request on to super-nodes, who pass the request among themselves. Why do the messages and protocols for a search request need to be different for Gnutella and FastTrack? In fact, Gnutella nodes and FastTrack nodes could be joined into one common network if they only used the same protocol.
In fact, even a network that differs from Gnutella as much as Freenet could use the same communication protocol. Gnutella has download requests as well as search requests. Freenet in essence only has download requests. When a Gnutella node receives a DL request, it sends back the file, or sends a rejection if it doesn't have the file. A Freenet node differs in that it forwards download (key) requests (much like the way Gnutella forwards search requests). Again, another example of networks that differ in node behavior and not communication protocol. Freenet's communication protocol is a subset of the Gnutella communication protocol.
I'm currently toying with a general purpose peer-to-peer protocol. When I say "general purpose", I mean that the protocol is extraordinarily general and abstract. Ideally, any kind of p2p network could make use of this protocol. However, the protocol itself is very simple, so simple that anyone could understand it. One could imagine p2p network developers "adding on" support for this protocol to existing p2p node software, allowing all networks supporting the protocol to be joined together.
One upshot of such a protocol is that it could be used to design a non-proprietary replacement for FastTrack: nodes would decide to become servers and then simply behave differently than the nodes that are clients.
I'm contacting you because I'm wondering if you've run into similar ideas or initiatives out there. If so, could you point me towards them?
Anyway, I'll keep you posted with news as I work on this more... What will the protocol be called? Is the name "konspire" taken yet? ;)
Thanks,
Jason Rohrer
HC software
--
http://www.jasonrohrer.n3.net

View file

@ -0,0 +1,50 @@
%
% Modification History
%
% 2001-October-8 Jason Rohrer
% Created.
%
% 2001-December-2 Jason Rohrer
% Fixed a few typos and mistakes.
%
\documentclass[12pt]{article}
\usepackage{fullpage}
\begin{document}
\title{Generic peer-to-peer protocol notes}
\date{Created: October 8, 2001; Last modified: December 1, 2001}
\maketitle
We can think of a peer-to-peer network as a group of hosts such that each host has a set of resources made available to the other hosts in the group. Peer $A$ can ask peer $B$ for a resource $r$ by sending $B$ a resource descriptor $D_r$, and peer $B$ can respond in one of \ref{response:count} ways:
\begin{enumerate}
\item Send a null response (equivalent to ``I know nothing about $r$'').
\item Return the requested resource $r$.
\item Return the address of a host that may know more about resource $r$ (referred to as informing $A$ of a more knowledgeable host). \label{response:inform}
\item Forward the request for $r$ on to other hosts in the network and inform $A$ that it is doing so.\label{response:forward}
\label{response:count}
\end{enumerate}
For response option \ref{response:forward} to make sense, we must assume that each resource descriptor $D_r$ contains a return address where the resource should be sent and a request identifier so that $A$ knows that a particular inbound transmission is a response to its request for $r$. To be compatible with request forwarding, all responses should be sent to the return address contained in the request. To make sense in this context, the null response should be embodied by no response at all. Because responses are not returned via the same two-way transmission channel on which the corresponding requests were sent, a responding host $B$ can in fact perform any subset of the response options simultaneously. Thus, the system can operate using only one-way transmissions. In this context, the null response corresponds to the empty response set.
A network that avoids option \ref{response:forward} could still be a very usable network and in fact would require far fewer network transmissions to operate (as well as distributing the work load more heavily upon those hosts actually requesting resources). However, such a network would not take advantage of the work distribution possible with a forwarding network. Note that a non-forwarding network can be just as powerful as a forwarding network by clever use of response option \ref{response:inform} in place of \ref{response:forward}. We will deal only with forwarding networks throughout the rest of this document.
As a simple example, consider the case where $r$ is file data. $A$ sends $D_r$ to $B$, and $B$ might execute a null response or return the requested resource by sending back the file data associated with $D_r$. We might assume that $D_r$ contains information that uniquely describes a file resource being offered by $B$, so neither response option \ref{response:inform} nor response option \ref{response:forward} would apply.
For a more complex example, consider the case where $r$ is a search results set. Note that a search is itself a resource that might be offered by a host, and we might have specially-designated index nodes in the network that offer the resource of searching. Suppose $A$ sends $D_r$ to $B$. $D_r$ might contain information about the resource types to search for, as well as a set of resource descriptors. Suppose that $B$ does not have the capability to perform the requested search, but is aware of a super-node $B'$ that does have searching capabilities. $B$ has two options at this point: return a description of $B'$, or forward $D_r$ to $B'$. Since the method of executing the first option is obvious, consider the second option. $B$ forwards $D_r$ to $B'$, and $A$ waits for a response. $B'$ performs the search, constructing the set $R = \{D_{r'} \mid r'$ matches the search criteria of $r \}$. $B'$ attaches the identifier from $D_r$ to $R$ and then sends it to the return address found in $D_r$. $A$ receives $R$. By examining the identifier, $A$ knows $R$ is a response to $D_r$.
In addition, $B'$ might forward $A$'s search request on to other indexing nodes. Thus, $A$ might receive multiple responses to $D_r$ and could track them all by their common identifier.
As another example, we might think of an indexing node $B$ sending requests to non-indexing nodes to collect data about their resources. In this case, resource $r$ would be a resource list. $B$ might send $D_r$ to $A$, and $A$ might return its resource list as well as forward $D_r$ to the resource hosts that it knows about. In this case, $B$ might receive responses to $D_r$ from many different hosts, but all responses will contain an identifier that $B$ can recognize. Thus, $B$ can collect a substantial index of resources in the network simply by sending out a single request. $B$ might have a limit on its index size, so it might send out $D_r$ to various hosts until it builds a large enough index and then discard further responses. $B$ could update its index in the future by sending out a new $D_r$ with a different identifier.
Because response option \ref{response:inform} and response option \ref{response:forward} are equally powerful, we can ignore option \ref{response:inform} in our further discussions. We choose to ignore \ref{response:inform} because we are primarily interested in the work distribution possible with forwarding networks. However, we should note that using option \ref{response:inform} wisely can result in drastic savings in terms of the number of network transmissions sent. For example, if only forwarding is used, the following situation might arise. Consider the host set $\{A, B, C, D\}$, and suppose that the only indexing node is $D$. Let $K$ be a binary knowledge relation such that $(X,Y)\in K \Rightarrow$ ``$X$ knows about $Y$''. In our system, assume $K=\{ (A,B), (B,C), (C,D) \}$. Suppose $A$ needs a search result set $r$. $A$ can only send $D_r$ to $B$. $B$ forwards the request to $C$, and $C$ forwards to $D$. $D$ fills the request and sends the results back to $A$ via the return address in $D_r$. When $A$ needs to search again, the same transmission process occurs, resulting in 4 transmissions for each search requested by $A$. By using response option \ref{response:inform}, $C$ might inform $A$ about $D$. Thus, $A$'s first search would require 5 transmissions, but each additional search would require only 2 transmissions. Even if response option \ref{response:inform} was used by $B$ during $A$'s first search to inform $A$ about $C$, $A$'s first search would only require 6 transmissions.
For a set of hosts $S$ such that $|S|=n$ and for an arbitrary knowledge relation $K_s$, if $(X,Y)$ is in the transitive closure of $K$, then the worst case number of transmissions for each request fulfilled by $Y$ for $X$ is $O(n)$ if only forwarding is used. If the informing response option is always used along with forwarding, we have $2(n-1)$ transmissions in the worst case for the initial request, and $2$ transmissions for each subsequent request. For $O(n)$ requests, we execute $O(1)$ amortized transmissions per request.
\end{document}

View file

@ -0,0 +1,74 @@
This file contains specifications for resource types. For each type, we give a specification for the resource descriptor and then the resource. Newlines in this specification can be replaced by any kind of white space.
GET_REQUEST_FAILURE
[no description]
GET_REQUEST_FAILURE
resource_descriptor
(Note that a GET_REQUEST_FAILURE resource contains the descriptor from the failed request. The PUT request for a GET_REQUEST_FAILURE resource should use the ID number from the original resource request.)
MORE_KNOWLEGEABLE_HOSTS
resource_descriptor
MORE_KNOWLEDGEABLE_HOSTS
num_hosts
address_1:port_1
address_2:port_2
...
address_N:port_N
SERVER_LIST
[no description]
SERVER_LIST
num_servers
address_1:port_1
address_2:port_2
...
address_N:port_N
SEARCH
num_allowed_resource_types
allowed_type_1
allowed_type_2
...
allowed_type_N
num_search_terms
search_term_1
search_term_2
...
search_term_N
SEARCH
num_results
result_descriptor_1
result_descriptor_2
...
result_descriptor_N
FILE
host_address:host_port
file_path
FILE
file_path
file_length_bytes
###file_data
(Note that for the FILE type, the ### must occur immediately before the file data (no whitespace must separate ### from the start of the file).