Updated 2012-09-08 23:27:38 by RLE

Name for the concept of placing the contents of a whole website into a single file and serving it directly from there. By Neil Madden.

He, Neil, already uses something near to this for his personal website, based on Metakit and tDOM. I.e. he stores XML and converts that to HTML. This is currently done offline. The moment this is changed to online generation the StarSite is complete.

AK: Related to this is a notion by BrowseX. This browser allows the retrieval of a website and its storage into a zip-archive. This is usually meant for easy transfer of a website, but also allow display of the archive via BrowseX, without unpacking it (IIRC - AK).

AK: Obvious extensions of the concept.

  • A local mode, i.e. running the Starkit containing the StarSite, or providing access to it [*], in a non-web environment pops up a Tk based display which allows browsing the site without web-browser.
  • An extension of such a local mode would be to enable the editing of pages in the site.
  • Allow the starkit to run not only as cgi-type application, but as its own web server.

Depending on the exact nature of how web pages are stored in the StarSite this can have significant overlap with the code base providing the Tcl'ers Wiki. Especially as some discussed extensions to it would allow the storage of not only Wiki markup, but other types of data as well, like images, HTML, or XML. The similarity should be clear by now.

AK: [*] I should explain. My first association was that the code providing access to the contents of the StarSite was part of the StarSite Metakit, in a VFS, making the StarSite a StarKit, or StarPack. As the Wiki codebase shows that doesn't have to be the case. The StarKit containing the code can be distinct form the Metakit database containing the StarSite.

Regarding editing: For HTML this might have to be a free-form editor. For XML we can use StarDOM. Also note that the AlphaTk editor already has a Wiki mode. Extending this to HTML and XML modes might be simple. This implies that the StarSite access StarKit does not need to have the editor embedded into it, although that is an option too. It just has to have way of invoking editors we can hook our preferred editor into.

NEM - Yes, I was thinking along these lines. Some other things I am considering are:

  • Storing data with a mime-type association (text/html, text/xml, image/gif etc). I don't believe that mk4vfs does this presently.
  • Allowing viewing the database as a metakit database, or a filesystem (both are useful at times).
  • Some sort of authentication/access-control built in. Wiki type applications with universal access are useful for some things, but often, you want more security. This needs to be designed in from the start, to be effective.
  • Versioning/Archiving (just like the wiki, but maybe more fine-grained?)
  • Ability to run as standalone HTTP or as CGI, with a consistent scripting API in both environments (ie a script shouldn't care).
  • Some mechanism for plugging in XML/XSLT transformations.
  • Ability to query database using XPath???
  • Ability to group items together (for instance, grouping identical pictures in different formats: a .gif/.jpg for web grouped with a bitmap for WAP).

Lots to think about. I think that getting the authentication/access-control stuff right will be the toughest bit. Looking at zope, everything seems to be an "object" (including users, scripts, static content), which can have access permissions granted to it. I know adding security features might seem like overkill, but it needs to be there from the start if anyone wants to use it (which I will). Adding it in later would be a bit hackish.

NEM - Excellent summary. This is exactly what I was planning. My main interest was in XML/XSLT generation of content, but really anything should be possible. The StarSite would sit on the server, and intercept requests using the PATH_TRANSLATED variable. So, for instance, in my website currently, the script xml.cgi can be invoked like http://www.tallniel.co.uk/cgi-bin/xml.cgi/home.xml which grabs the home.xml file and applies necessary stylesheets to it. Likewise, images could also be requested and returned from the database. The fact that MetaKit is the backend, allows for sophisticated searching and user interface options (session management, personalization etc). Mirroring a site would be a case of copying one, highly-compressed MetaKit datafile. I find this concept quite exciting.

Note that the Wikit is a case of putting the contents of a website into a file. I see above that Starsite would include a web server and, rather than using a markup style and conversion like the wikit does right now, would use xml as the markup and tdom or tclxml as the conversion software. Another difference appears to be that wikit is about content management, in a sense, in that visitors to the web site have the ability to update the pages. What other differences are envisioned?

Well - as far as I am aware, wikit only allows the inclusion of the textual content in one file. The StarSite concept takes this a bit further, by allowing images, media etc to be stored in the same file, as well as other information (e.g. a user database). The idea of a starsite, as I (NEM), envision it, is that it should be able to do whatever a normal website can do, but with the added advantage of having everything in one file. So, you could, theoretically, put a wikit inside a StarSite. That is how I see it developing. At the moment it is nothing but this collection of ideas. When things start to reach a more coherent state, I (and any others who wish to join me) will sit down and start making it. The ability to update a StarSite (or parts thereof) over the web, is a feature I would like to include. The XML references are just there as that is what I like to create my site in. However, I feel StarSite should be broader than that. It should be a means of encapsulating a whole web site, with various common functionality available to make things easier (collaborative editing, authentication, session management, data storage etc). In the simplest case, a person would fire it up at home and use the Tk GUI to add static content (HTML, pictures etc). When finished, they would simply ftp the file to the webspace they use (in a cgi directory), and it would just work (just like starkits - no hassle installation). Alternatively, it could run as its own webserver, for intranets and the like. StarKits solve installation problems for regular applications. StarSites would solve it for web applications.

AK:

  • Look at Ideas for Wikit enhancements and Christophe Muller to see the overlaps.
  • Using mime/type association for the content: Exactly as proposed for the wiki. Note that the wiki stores its pages directly in Metakit tables. It does not use the mk4VFS for its contents.
  • mime-types / mk4vfs: Interesting idea. Generalized: User-defined attributes for files. I am not sure, but I believe there are even native filesystems which might support this. Needs research.
  • Authentication/Security: Agree with building this in form the start.
  • Authentication/Security: Has to allow deactivation. Example: Wiki
  • Versioning/Archiving: The wiki codebase itself remembers the times of any change, and also saves out any change to a directory, if so configured. It only does not remember the exact changes/diffs in the internal database. The history of the [Tcler's Wiki] itself is a daily CVS import of the current state, making this more coarse-grained than the wiki codebase is able to support.
  • Regarding plugins: Ties to mime-types in my view. Based on the mime-type of a content page, and the chosen output medium we can choose which renderer to use, which editor to use, etc. The wiki already has several Wiki Markup renderers chosen automatically upon 'format' flag and medium (Tk vs. Web).

NEM 30Nov2002: Latest brainstorming on this (flow of control of a request coming into a starsite):

  • The whole system sits on a special virtual filesystem, with some differences:
  • Files have a mime-type associated with them
  • As well as directories, there is the concept of sections. These are mounted on to directory points, and control access to all files from that point down (until a new section starts).
  • These sections are essentially directories, but with some procedures associated with them - namely a handle request procedure, and a handle error procedure. (Possibly others).
  • Sections have an access-control list associated with them. This consists of a list of groups and a set of permissions. Initially, I think the following permissions:
  • page - create, delete, read, edit.
  • subsection - create, delete, read, edit.
  • groups are like they are on UNIX. users are people viewing/editing the site. Users belong to groups. There are two special users: anonymous is a non-logged-in user, webmaster is the super-user. There are like-wise two such-named groups which contain these users. The webmaster (or admin, or root, ...) group has complete access to everything, while the anonymous group typically would only have read-only access (notice, though how a section can override this in its access control list, so a wiki could work). A user who has edit permissions on a section can alter the permissions (?? - maybe).
  • Access to this VFS is through a special API (probably not the standard Tcl VFS API, due to the need for mime-type associations).
  • Right, now onto how this all works:
  • A request comes into the starsite (either through the built in webserver, or CGI or...). The first stage is to authenticate the user. A separate (replaceable) module handles this. It simply does all it has to do to determine who the user is. It removes any trace of its mechanism from the input (so, if it used a cookie, it would remove the cookie from the list passed in). It returns the username of the person making the request, or anonymous if they are not logged in. This module could work in any way, and so will be replaceable. It only works out the user name, it does not do access-control.
  • Next step, the starsite runtime looks at the requested URL, and figures out which section it falls in (as sections are mapped onto directories, this will be by just finding the most specific directory which is a section map point).
  • The starsite works out the format that the client wants the result in. It will use a (customizable) algorithm based on specific request (e.g. if .html was request then return HTML), accept-type headers, and finally, as a last resort user-agent strings.
  • Access-control: The star-site then looks up the access control list of the section in question, and compares it against the groups which this user belongs to. If the user has access to this section, then we call the request handler for this section, passing in the requested URL, the requested mime-type, and the arguments passed (from ?blah=foo&a=b stuff, and from POSTed data etc. PATH_INFO/PATH_TRANSLATED stuff will not be passed in here - it will be used to figure out the requested URL).
  • The request handler retrieves the file, and performs whatever processing it needs to do (e.g. dynamically generating the file, applying style-sheets, etc), and returns the file contents. The mime-type etc will already have been set. There may be an API for adding extra headers etc.
  • If an error occurs at any time, or if the user doesn't have the correct permissions, then the section's error-handler proc will be called with a mime-type and the error message. It should format a nice error message and return it.
  • How to enforce access-control within a section handler? Well, here I thought the best way, would be to only allow access to the VFS through a special API. When a request comes in, the request handler is called in a new interp (or one from a pool). This interp is a safe interp with access to the VFS API setup through aliases. These aliases incorporate the username into them, so that they can check access control, without the content-handler having to pass through the name of the user (that would be open to attack).
  • All access to the VFS and StarSite internals would be through these safe interps with checked access control. This keeps the starsite secure (at least, I think so, but I'm not a security expert - comments appreciated).
  • Versioning/history could be activated on a per-section basis, by adding more information to the interpreter aliases for the API - if an argument is flagged in the call then a versioning routine is called. In fact, the sections could each have an update-handler which handles edits of files, and can store away the old version.

This is just some brainstorming, and hasn't been thought through to the bitter end. I quite like the design, but I'm willing to take criticism to perfect this in the design stage. Consider this, a request for comments! Neil Madden.

30nov02 jcw - Interesting ideas... can you elaborate on the usage scenarios? Is this for deployment, i.e. creating a complete site and shipping it? Or more to to keep things manageable and self-contained?

Note that authorization per dir/subdir is supported in Apache through ".htaccess" - if tclhttpd has similar capabilities, that might be a very quick path to add such features to StarSite, since tclhttpd can work with (as well as *in*) VFS.

Currently, it is not easy to extend VFS with extra info such as a mime type, even though Metakit could easily deal with it. The reason is that the VFS layer opens with a certain layout, which would lose any fields added. Hm, having said that - it's probably possible to open, and immediately reset the layout to include those fields again - data would not be lost. But this leads to another problem: how to make VFS aware of fields such as a mime type. My hunch is that you're best off maintaining a separate data structure for mime types. If stored as a Metakit view could still be in the same VFS file (i.e. starkit), with some tricky hacking. If you'd like an example of how to store other views in a starkit, next to the VFS file system tree, let me know.

30nov02 NEM - To answer your questions in order: I see starsite as being able to create self contained sites and then deploy them as complete items, but also to allow editing after they have been deployed. I envision created a general web interface which allows creating new sections etc. This could be enabled or disabled, even on a per-section basis. Also, section handlers running under appropriate permissions (the permissions of whoever is accessing them, not whoever created them) could update content, and add new content, to that section.

Authorization: I intend to make this fully customizable. Someone could write an authorization routine which uses .htaccess, for instance (although I don't know how this would work in a VFS). Other authorization methods could be used as well. For instance, for my own personal site, I would probably use a custom login procedure, as I do not like .htaccess and I only have CGI (with no SSL). I was thinking about including tclhttpd into the basic starsite so that it can run standalone. It would also be able to run in a cgi environment.

VFS: Yes, currently it would not be easy to add mime-types etc to a VFS. My usage of the term VFS was perhaps confusing, as I was thinking more in general terms of accessing a sort-of filesystem through an API, rather than particularly using Tcl's VFS layer. It would be nice to use it, but I'm not sure how useful it would be (the API commands would probably be quite different to open, read, close etc).

Examples would be nice. I think I could figure it out, but probably best to find out the best way of doing it.

30nov02 jcw - Ah, ok, hence the reference to Zope - a site, ready to be filled in by content providers, comes to mind. One more thought: maybe WebDAV makes sense in this context? I'll try to come up with an example for storing data alongside VFS in the same file, one such use would be to have wikit store its pages in the same file.

1dec02 NEM = WebDAV is certainly very interesting. I'll read through the RFC and see which bits make sense here (probably a large proportion of it). If I can use an existing standard, then so much the better.

30 Nov 2002 escargo - I don't know if it is practical, but perhaps your permissions could be more general. In a Multics system, one of the permissions was append. That meant that you could not modify existing content, but you could add to the end. (This applied to directories, but the notion should be transferable to other domains.) This would be more of a notes file capability than a wiki capability, but still might be worth considering.

1dec02 NEM - Good idea. The permissions were off the top of my head. Append is a good one. Can we come up with a complete list? Maybe it would be possible to allow a site maintainer to define their own set of permissions? Hmm.. I think a reasonable list would probably be best. Time to think of use-cases, I guess.

25jan-3 NEM - A complete list of permissions for the initial version will be:

  • page:
  1. read - allow reading only
  2. append - allow appending to the end of a page
  3. edit - full editing of a page
  4. create - create new pages
  5. delete - delete a page
  • section:
  1. create - can create new sub-sections
  2. delete - can delete sub-sections
  3. admin - can alter permissions on this section. Also general access to alter the scripts associated with the section.

This would be specified as a a single byte:
   page section
   raecdcda

In general, anonymous users would have just page read access only (permission 10000000), whereas a webmaster would have 11111111. People could be designated as editors of a section with permission 11110000 - i.e. they have ability to read, edit and create pages, but cannot delete pages or change permissions.

Another item for the implementation will be to associate a lock with each section, so that updating of the database can be done safely, and with a per-section lock granularity. This could be upped to a per-page lock, if deemed necessary (I think per-section will be acceptable for most sites). This locking will be done automatically in the VFS layer, so section handlers need not worry about it. To start with, locking will be implemented for single-threaded tclhttpd implementation. Later work will expand this to work with threads, and CGI. CGI is the most difficult, as without marshalling all accesses through a single process it is difficult to perform effective locking. Lock files would have to be used (for CGI), but these are nasty. A possible implementation would have all updates written as separate files into a directory, and then a separate process would lock the whole database and apply all the changes at some point in time (for instance, when the web-master logs in a runs a command).

Time to get coding...

26jan03 NEM - Well, starting coding has brought me round to a new implementation idea:

Generalize Starsite into a Persistent, Authenticated Object System

After spending some time yesterday contemplating design issues, I have hit upon a design which I think could be useful. Instead of writing starsite as an application with a secure database API, why not write a secure, persistent web application framework and then implement StarSite in that system?

The details:

  • Applications are written in terms of objects and classes. This is the standard OO bit, but there are some differences.
  • All member data is stored directly in a metakit database.
  • Access to read and write the member data is performed through an API.
  • The API is authenticated. By this, I mean that all object instantiations are performed in the context of a safe interpreter, and with the permissions of a particular user (more details below).
  • Object data can be anything, but it is always stored with a mime-type tag. This allows the same object to reference different data depending on the mime-type that is being requested. Wildcard mime-types would be allowed. The main benefit of this, is that you can group content together but have alternative version for different output interfaces - e.g. image/gif for web browsers, image/bmp (or whatever the mime-type is) for WAP devices. This greatly simplifies the task of the web application, as it can behave consistently with little regard for what client is using it. The object system would determine the correct mime-type (or most specific match) and load the appropriate data. Of course, there will be a mechanism for accessing the other mime-type data if necessary.
  • The basic look of a class definition would be:
 class Foo {
     field title
     field body

     method foo {args} {
         # Accessing a member field:
         $this get title
         # Accessing a specific mimetype
         $this get title -mimetype text/html
         # Setting a field
         $this set body "<h1>Hello, World!</h1>"
         # Setting for particular mimetype
         $this set body "<header>Hello, World!</header>" -mimetype text/xml
         # What is the mimetype requested?:
         $this mimetype
         # Change the mimetype (changes the HTTP output headers as well)
         $this mimetype "text/xml"
         # Append to a field
         $this append title "<h2>This is a comment</h2>"
         # etc.
     }

     method foo image/gif {args} {
         # Override the general foo method for image/gif mimetype requests
     }
 }

Objects can be instantiated in a hierarchy. There is always one main object for the site, which is the starsite object (or a different object if a different web application is being created). This is similar to the Tk widget/object structure, with "starsite" or whatever being the root object which always exists. You can then do:
 Foo /starsite/myfoo

and this will create an object of type "myfoo" under the starsite object.

Calls to the objects are handled by processing incoming HTTP requests. For instance, if a request came in to:
 http://my.server.com/cgi-bin/starsite/myfoo/foo?arg1=hello&arg2=world

This would cause a lookup to see what the most specific object being requested is. In this example it is "myfoo" (otherwise it would be "starsite" - the root). So, a call is made to the "foo" method of the object "myfoo", like so:
 $myfoo foo {arg1 hello} {arg2 world}

There are a couple of details here too:

  1. If a call was made to /cgi-bin/starsite/myfoo?arg1=hello&arg2=world the system would try to look for a method "myfoo" on the object "starsite". If no such method exists, then the call is redirected to /cgi-bin/starsite/myfoo/?arg1=hello&arg2=world. This is done with a standard HTTP redirect message.
  2. If a call comes in to /cgi-bin/starsite/myfoo/?arg1=hello&arg2=world, then the call is sent to the "default" method of the object. This method must always be implemented.

It will be possible to set up objects as redirections:
 starsite::redirect /starsite/newfoo -> /starsite/myfoo

would cause all method invocations on /starsite/newfoo to result in an HTTP redirect to /starsite/myfoo.

A mapping is maintained between hierarchy positions (URLs) and objects. This is a strictly one-to-one mapping.

Each object has the following properties associated with it:

  • An owner - the userid of the person who created, or otherwise owns this object. This person has full control over the object.
  • An access-control list. Specifies which groups of users have what permissions for accessing this object.
  • A flag saying whether changes to this object should be archived. This is a hook for version-control systems.
  • A database lock. This lock must be acquired before changes to the object can be made. I may implement two locks - a read lock (shared) and a write lock (exclusive). I may even implement this as synchronized blocks like java, but I'm not sure about that. Needs thought as to what the best mutual exclusion policy is.
  • A collection of member data, organised by mimetype.
  • A collection of methods. It will be possible to flag whether methods are public (accessible through direct HTTP calls), or private (accessible only to other methods within the same object). All public methods will automatically cope with SOAP requests (and probably WebDAV) too.

The initial method of authentication will be based on username and passwords. All the object system cares about is having a user context in which to run objects. To this end, the method by which username and password is sent to the object system is left up to the specific application. This could be simple HTTP Basic Authentication, or it could be a secure connection over SSL. When a user authenticates, a session will be created for that user and returned. This session consists of a constructed safe interpreter with all the necessary API aliases created in it. The application can then set up some sort of session key (again, how this is done is an application implementation detail - could be a cookie, or some other, more secure session key). All requests for that session are processed within the context of that interpreter. This allows per-session data to be held as well (in global variables). The system will be configurable to destroy sessions after a given amount of time of inactivity.

Phew! This is another brainstorming session. It is quite possible that I am just creating another method of creating web applications, when I should be sticking to standards like WebDAV. To this end, I'm going to read up on WebDAV and SOAP a lot more before implementing this. I quite like the idea though of designing web applications purely around the logic of the site, providing content in whatever forms you want, and then letting the system worry about which content is suitable for which client. An extension to this would be to write content in XML and the system deals with selecting and applying XSLT stylesheets. However, the method proposed above is more general and can deal with selecting static content such as images, which are not easily expressed in XML (although there is SVG...). As an XML parser will have to be present to handle SOAP and WebDAV requests, providing XML and XSLT capabilities would be pretty easy.

Well, once again - comments welcomed!. Cheers, Neil.

NEM22May2003 - Haven't looked at this in a while. Now my final exams are coming to a close, I'd better start thinking about implementing all these ideas. In the meantime, I'm going to play around with as many open source/free projects of a similar nature (both Tcl things like Apache Rivet and OpenACS/AOLServer, and also non-Tcl stuff like [Zope]), to get a feel for what is useful in such a beast. Please read the above and comment, although this page is getting pretty big now. Please note, also, that I'm not at all sure if the above design is good anymore. Some tell-it-like-it-is criticism would be welcome!

escargo - I had been asking about e-mail notification of changes to wiki pages for this wiki, but part of the problem is that it realistically requires authentication before people can sign up (since you want to prevent people from signing up for notifications anyone but themselves). You are already planning on doing authentication. You could extend your design (or allow hooks that could be used to extend it) so that if a wiki page changed (or your more generalized objects), then an action could be triggered (in my case, an e-mail notification could be generated). This would require user data include an e-mail address; objects would have to have change listeners; somewhere there would be per-object metadata (for the listener list and the application data needed to keep the mail notification list). This is just "bells and whistles" as far as your desired features are concerned, but I thought this might give you an idea of how somebody might want to build on what you are doing.

NEM - Good idea. This can probably be added in by the application author, if I design things right, and wouldn't have to be present in the base system. Nevertheless, I'll bear this in mind. Right now, I'm thinking performance. The problem with my "everything's an object" system above is that it adds unnecessary overhead to things which are just static content (images, static HTML etc). For these, I'm thinking of object wrappers which are automatically created when a feature is needed, but not before. This way if a piece of data is just returned without any processing, it has no overhead. Ahh... much more thinking is needed on this.

PT 23Jun2003: I've recently been looking at Twiki which does support e-mail notification. There the user's home page contains the e-mail address to use and each 'web' (a twiki term which basically means a sub-tree of the entire wiki) can be configured to enable e-mail notification on a per user basis by adding usernames to the relevant page. Authentication is provided by having the user registration script automatically append a suitably constructed line to the site's .htpasswd file. In this wiki, changes tend to require authentication so it's always possibly to know who made what change.

I'm not certain that we need such overhead in this site, but Twiki has been designed to be useful in a corporate intranet environment, where tracability is important. As a side note - Twiki is reasonably easy to install under unix and windows.

DAG: See also CHMvfs.