|
speaking httpA File-Uploader Tool
by Oleg Kislyov
Oleg Kiselyov is a computer scientist with Computer Science Corporation, in Monterey, California. Soon it will be twenty years since he started using computers to solve somebody else's problems.
This article describes a simple HTTP uploading tool. HTTP is commonly viewed as something that happens between a browser and a Web server. However, HTTP is useful in its own right, for example, as a good file-distribution protocol with a number of important advantages over ftp. This article gives an example how to speak HTTP and get understood. The HTTP uploader is somewhat reminiscent of Microsoft Frontpage's server extensions. It lets you push (binary or text) content Web pages, images, binary files from one computer to another. If the source platform is Winxx/WinNT, you can make a shortcut of a script that will let you upload files just by dragging and dropping them onto an icon. The tool works through Web proxies and gateways. If you can download Web pages, you should be able to upload files as well. The uploader tool works on various versions of UNIX and WinNT/Winxx with different HTTP servers: I tried Apache, Netscape, and IIS. The tool is made of two Perl scripts, one of them being a CGI script. The choice of the implementation language is accidental and irrelevant. What deserves admiration is the HTTP protocol, whose power and simplicity make even far more complex applications possible.
HTTP Protocol
An operation to execute remotely is expressed in HTTP as an application of a request method to a resource. Additional parameters, if needed, are communicated via request headers or a request body. The request body may be an arbitrary octet-stream. The HTTP/1.1 standard defines methods GET, HEAD, POST, PUT, DELETE, OPTIONS, TRACE, and CONNECT. A particular server may accept many others. This extensibility is a rather notable feature of HTTP. The parties can use not only custom methods but custom request and reply headers as well. In addition, a client and a server may exchange meta-information via "name=value" attribute pairs of the standard "Content-Type:" header. Most of the HTTP transactions performed every day are done behind the scenes by browsers, proxies, robots, and servers. Yet the protocol is so simple that one can easily speak it oneself. The only requirement is a language or tool that is able to manipulate text strings and establish TCP connections. Even a simple telnet application may do in a pinch, which is often useful for debugging. Server-side programming is less demanding: a servlet or a scriptlet does not need to bother with the network connectivity, authentication, access restrictions, SSL, and other similar chores. Server modules or FastCGI give a server-side programmer even more tools: load-balancing, persistence, database connectivity, etc. This article demonstrates how to use Perl scripts to speak and respond HTTP directly.
Making an Upload Request An uptow Script
uptow dest-directory local-file-path It will copy the file specified by local-file-path to a remote site. The data will be placed into a specified dest-directory on the remote site under the same (base) filename. The server will typically prepend a predefined path to this dest-directory (e.g., /usr/local/htdocs or /w/data) to confine file updates to that part of its file system. This script publishes the files synchronously and always tells the result of the transfer. The remote site to which to publish is identified by a number of configurational parameters: $REMOTE_HOST, $REMOTE_PORT, and $TAKER_URI. It is trivial to modify the script to get these parameters from environment variables or to read them from a configuration file.
When called as uptow mysite/dev /tmp/data.txt, the script establishes a TCP connection to a destination HTTP server ($REMOTE_HOST) and sends
the following message:
The first line of the message is a request line. It is followed by request headers, in a "Name: value" format similar to that of RFC822 mail headers. The names in the headers are case-insensitive. The first empty line signifies the end of the headers. If the message includes a body as it does in our case the payload data is sent immediately after the empty line. The request line and the header lines are terminated by a carriage return/line feed (CRLF) character sequence: the character with decimal code 13, followed by the character with decimal code 10. The request line tells what operation to perform, where to put the payload data, and what version of the HTTP protocol we will speak. Some obsolete firewalls and proxies may either refuse PUT requests outright or break the connection without indicating any error. Also, some Web servers may be configured by default to disallow PUT methods (see below for the server configuration). If you encounter that situation, you can change the uptow and Update-w-Taker.pl scripts to use a POST method instead. The latter is more widely accepted. The PUT method nevertheless seems to be the most appropriate for upload. The location to store the payload data is specified by a Uniform Resource Identifier (URI). It is a character string that looks like an absolute UNIX file path. The meaning may, however, be different, as we will see. The uptow script creates this URI by appending the desired upload location to the $TAKER_URI. The "update taker" section below explains what an HTTP server does with such a string. It may happen that the server to which we upload data is not directly accessible. For example, the server and the uptow client may be separated by a firewall. All HTTP transactions between computers on the different sides of a firewall must therefore go through a dedicated Web proxy or a gateway. A Web browser or any other HTTP client has to be made aware of such an arrangement. Specifically, to tell the uptow script to use a proxy you have to set a $PROXY_NAME parameter. The script will then connect to that proxy and have it relay the request to the destination server. The relay request looks just like the direct upload request above. Only the first line is slightly different:
PUT http://hostname.org/cgi-bin/admin/ Update-w-Taker.pl/mysite/dev/
That is, instead of a URI naming a resource to create, we send the full URL, including the "http://hostname.org" part. Here hostname.org is the name of a host to which we upload the file. The proxy strips away this "http://hostname.org" part when it sends the request to the destination server. The HTTP protocol defines a number of headers that should or may be used during an HTTP exchange. Here we will describe a particular subset of headers that is used in file-upload transactions. The Host header identifies the request target server. It is a good idea always to supply this header. Moreover, it is mandatory in version 1.1 of the HTTP protocol. The User-Agent header identifies the client software the uptow script, in our case. The server usually quotes this information in its logs. An HTTP server may be configured to demand to know the identity of a user itself before it will consider a request of its agent. The Authorization header should be present then to specify an authentication scheme and the corresponding credential. In the most basic authentication scheme the one used in the example shown a user is identified by a symbolic ID and the password. These two strings separated by a single colon (:) character and BASE64-encoded constitute the corresponding credential. Every Web server is guaranteed to support the basic scheme. Yet it is hardly secure, since it transmits passwords in an easily decodable form almost in plain text. Incidentally, ftp and telnet protocols suffer the same problem. HTTP/1.1 defines a more secure Digest scheme[2]. An HTTP client may also attempt a Secure Socket Layer (SSL) connection to a Web server. SSL is a lower-level (transport) protocol; therefore, the content of an HTTP conversation is unaffected by the fact it is to be transmitted over an SSL connection. When a request such as ours has a body, the type and the size of its data have to be identified, by Content-Type and Content-Length headers. The former should tell the media type of the data: the "MIME type," as it is often called. Content-Length is the size of the body in bytes. If the data being uploaded is ASCII text, the media type may be set to "text/plain," as above. When the payload is intended to be stored without any further processing, the "application/octet-stream" MIME type seems the most appropriate. Although the request line and the headers are in ASCII, the body of an HTTP message can carry arbitrary data. Unfortunately, some obsolete Web proxies and gateways (notably Raptor 5.0) are not 8-bit transparent: they do not like zero bytes in a request stream. Apparently firewall programmers used a function strncpy() where memcpy() would have been more appropriate. The uptow script tries to check whether a file to send is ASCII or binary. If it's ASCII, the media type of the Content-Type header is set to "text/plain," and the file is sent as it is. Otherwise, the data is encoded into a hexadecimal stream; BASE64 encoding can be used as well. The media type of "application/x-octet-stream-b2a" identifies the encoded content. A Transfer-encoding header may seem the most fitting place to specify an encoding. Alas, Apache accepts only one value for this request header: chunked. Any other value in Transfer-encoding results in a BAD_REQUEST error. In any case, the payload encoding concerns only pushing of data via a particular obsolete proxy, which I happen to be burdened with. If your Web proxy follows the HTTP standard or you connect to a server directly, you can set the media type to "text/plain" or "application/octet-stream" and forget about encoding. It is not commonly recognized that a Content-Type header may carry parameters in the "name=value" format. The parameters are separated from the media type and from one another by semicolons. The value can be an arbitrary string, possibly quoted if it contains spaces and other special characters. In our example, Content-Type has one parameter: filename. It tells the base name of the file being uploaded. We could have just as well passed this information via a custom request header, for example, X-Filename: data.txt. HTTP is an extensible protocol, which explicitly allows custom headers. A server ignores any headers it does not recognize. HTTP protocol has another powerful feature that unfortunately remains relatively obscure: the body of an HTTP message may be composed of several parts. This is similar to multipart/mixed or multipart/digest MIME email messages, which may carry several pieces of information within a single entity. We can therefore upload several files in one transaction by encapsulating them as separate parts of a single request body. We can also upload a tar file and have the server extract its members. In any case, the corresponding modifications to the uptow and Update-w-Taker.pl scripts are trivial. Strictly speaking, we do not have to use the uptow script to upload a file. For example, we can forego convenience and employ a spartan tcp-transaction tool, tcp-trans[3]. We can even enter telnet hostname.org 80 on the command line and type in the request, line by line. Pressing a "Return" key is enough to terminate a line. Although it is not the same as sending the CRLF combination, many Web servers are rather forgiving. If a server accepts the submitted data and successfully stores it in a desired location, it sends an acknowledgment, an HTTP message:
HTTP/1.1 201 Created /w/data/mysite/dev/data.txt CRLF
The first line of a server response is a status line. It tells the protocol version the server speaks, a numerical result code, and a brief description of the success or failure of the request. The numerical code is a three-digit number intended primarily for a nonhuman agent. A code within the 200 range signifies a successful completion of a request. A 3xx result code tells the agent that an additional action is necessary; a 4xx code is returned if the request is invalid or cannot be fulfilled (for example, because a user failed to authenticate itself or does not have sufficient permissions). Result codes within the 500 range indicate a serious problem on the server side; see the HTTP document[1] for more details. The status line in a server response is followed by reply headers and an empty line. The latter signifies the end of the headers. The response body, if sent, follows right after the empty line. In case of the 201 reply, there is no body. If the server rejects an upload request or fails to satisfy it, the server sends a response as well, with an appropriate error code:
HTTP/1.1 403 Forbidden CRLF
Update Taker An Uploading Server
When an HTTP daemon receives the request, the daemon notices that the request URI string starts with /cgi-bin/. This matches a ScriptAlias rewriting template of the server's configuration. Having performed this and possibly other substitutions and alias expansions, the server scans the components of the resulting path. For example: /usr/local/www/cgi-bin/admin/Update-w-Taker.pl/mysite/dev/
The HTTP server notices that /usr, /usr/local, . . . /usr/local/
CONTENT_LENGTH=10
In particular, REQUEST_METHOD tells the method: PUT in our case. All but the well-known request headers are passed as environment variables whose names start with "HTTP_", e.g., HTTP_HOST and HTTP_USER_AGENT. If we submitted a request with the header X-Filename, the CGI script would check for an environment variable HTTP_X_FILENAME. When the request method is PUT, CONTENT_TYPE and CONTENT_LENGTH environment variables must be present to tell the message size and data format. When parsing the transformed URI above, the server stopped at /usr/local/www/cgi-bin/admin/Update-w-Taker.pl. But the URI continues with /mysite/dev/. This string, if not empty, becomes the content of the environment variable PATH_INFO. The HTTP daemon treats this information as a string the server does not make any attempt to check whether this string represents a local file, or even whether the string is a valid path string at all. The content-type of a submitted file must be either application/x-octet-stream-b2a; filename="data.txt" or text/plain; filename="data.txt" This content is stored in a file with the given "basename" in a directory specified by the PATH_INFO parameter, after prepending the $Dest_root. Thus it is generally impossible to place the content outside the $Dest_root tree. Alas, symbolic directory links may defeat this safeguard. The target file is created if needed. The script must have permissions to write into this file or create it. This script responds in a "201 Created" message or in one of the HTTP error codes. All Taker's activity is logged. Note that both uptow and Update-w-Taker.pl scripts are (deliberately) written using only the most basic facilities: the core Perl and a Socket module. Therefore it is trivial to rewrite the script in some other language, such as Python or TCL.
HTTP Versus FTP as a File-uploading Protocol
Both ftp and HTTP can be used to upload files from within a firewall. HTTP, however, is designed to operate transparently through proxies and gateways, while ftp requires special SOCKS, etc.enabled clients and possibly a PASSV mode. HTTPFS can rely on authentication mechanisms already built into Web servers, in addition to its own access control. Whenever a file gets uploaded, a receiving HTTP server can synchronously fire up triggers and run arbitrary hooks. This is very difficult to accomplish with ftp. Moreover, if an uploaded file is meant to be fed into an application (e.g., tar, content indexer, META-tag creator, etc.), a receiving HTTP server can launch an application and have it process data while it arrives. There is no need to save incoming data to a file and then pass it to an application. HTTP offers similar advantages over ftp as a file downloading tool. HTTP and ftp also differ in how tightly they couple a client and a server. When an ftp client uploads a file, it has to perform a cd and possibly chmod, ren, and other operations on a remote server, in addition to the PUT operation. If an administrator of the remote site wishes to have the content put under a different name in a different location, she cannot do that unless she talks to the user making an upload and gets him to change the cd command. During an ftp session a client exercises control albeit limited over a server. This is not the case with an HTTP upload. A client does not perform any directory navigation or file operations on the server site. The client merely hands over the data and indicates desired file and directory names and similar meta-information. It's up to the receiving server to store, process, or even discard the content as the server thinks fit. The client has no idea of or control over the way the server processes the submitted data. That means a server administrator can change handling of the incoming content at will and the client will never know or care.
Advanced Applications of HTTP
A particular HTTP-based network virtual filesystem is described at <http://pobox.com/~oleg/ftp/HTTP-VFS.html>. It allows one to access, create, and modify remote files as if they were on a local filesystem and to handle RFC822 email messages as if they were local read-only directories. Each email header with the message's body constitutes a "file." An advantage of HTTPFS is that it lets one develop XML, etc., "filesystems" quickly, without any need to modify the kernel.
REFERENCES
[2] J. Franks, P. Hallam-Baker, J. Hostetler,
[3] tcp-transactor-- a shell tool.
Appendix: HTTP Uploading Tool
# Configuration parameters
my $buffer; # i/o (socket) buffer...
# Main module
# Check the file to publish...
print STDERR "Sending $file_name of $file_size bytes...\n";
# Establish the connection with a server
# Making the request (first in $buffer) syswrite SOCK,$buffer,length($buffer) || die "Request sending error: $!";
my $to_read = $file_size; my $res;
# Read the status line the first line of the response...
# Read the rest of the response and dump it...
if( $response_code == 304 )
if( $response_code >= 300 )
print STDERR "Success\n";
# Print help as how to use the program. Print $1 as the title
open(THIS_SCRIPT,"$0") || die "Can't open this script to print out
help, due to $!";
|
|
Last changed: 20 Jul. 2000 mc |
|