Current development on JAMWiki is primarily focused on maintenance rather than new features due to a lack of developer availability. If you are interested in working on JAMWiki please join the jamwiki-devel mailing list.

Tech:Zipped topic content

ktip.png This page (and all pages in the Tech: namespace) is a developer discussion about a feature that is either proposed for inclusion in JAMWiki or one that has already been implemented. This page is NOT documentation of JAMWiki functionality - for a list of documentation, see Category:JAMWiki.
Status of this feature: NOT IMPLEMENTED.
Contents

Description[edit]

To conserve disk space, provide an option to store topic content in zipped format.

Author(s)[edit]

Status[edit]

Some code committed to /branches/ncsaba

Comments[edit]

Store less text for topics[edit]

Moved from Roadmap#Store less text for topics

One more possibility is to zip the current content and all version's contents. That will save space even if there's only 1 version, but it will add processing time and memory needs for the zipping/unzipping process. I experimented a bit with this, and it looks like zipping would result in ~ 2-5 times less space occupied. If the string is small, zipping actually results in slightly bigger size, using the java.util.zip code and forcing it to use no headers the overhead is around 2 bytes. For random data zipping is not helping (it makes the zipped data bigger, I experimented with this too, it adds ~ 10 bytes), but topic content is far from random. I actually started to implement this: I added a PROP_TOPIC_COMPRESSION_ON to Environment, a new field in admin.jsp to set it, the corresponding admin.caption.usecompression transalation resource, and isolated the places where the zipping/unzipping could be done transparently if set up to do so: DefaultQueryHandler#insertTopic/updateTopic/insertTopicVersion (zipping), DataBaseHandler#initTopic/initTopicVersion (unzipping). I have relatively simple code in place to do the zipping/unzipping... now the big stumbling stone: the result of the zipping is a byte array, and that won't work well to be saved into a unicode text database field... even if I figure out all the possible encoding problems (which I actually can do), it will still be suboptimal, because it will reinflate with a factor ~ of 1.5 the zipped bytes, due to the fact that UTF8 will use 2 bytes for half of the byte codes. So the optimal solution for zipped storage would be a binary field, which poses the problem that the field type would be different based on a setup property... which I really don't like. Not that I have any other solution which I like... do you have any better ideas ? I will have to implement some kind of zipping if I want to import the en wikipedia, I don't have enough disk space on the machine I do this to hold it unzipped ;-) TIA, Csaba 16-Nov-2006 09:22 PST

I'm thinking as I type here, but to address the byte array issue what about adding a topic_content_zip column to the jam_topic_version table that stores a relative filename, and then saving the zip content to the filesystem? For sites using zipped content the topic_content column would contain null values, and the topic_content_zip column would be used. It's not the cleanest solution, but that's the first idea that came to mind. -- Ryan 16-Nov-2006 10:11 PST
OK, this is getting a bit better. I was actually thinking about something similar while on the train commuting :-) So how about having another column, but having it a binary field and not pointing to a file ? DBs are better at handling millions of rows than the file system handling millions of files... BTW, some DBs (notably postgres) will zip the text fields by default if they exceed a certain size (see toast tables in postgres). So the zip optimization is only needed for DBs which don't do that, and I think h2 is not one of them... Anyway, if I add a topic_content_zipped field to the table, the reading code can be ignorant of the "zipping_on" setting and take whichever is set. This way you can change the setting for a running system, and will take effect from then on... this was another dilemma, what to do when somebody changes the setting for an already running system, but this way it would just work. So I'll go on and implement it this way... Cheers, Csaba 17-Nov-2006 01:25 PST
I'll be interested in seeing how that solution works - let me know if you have any questions or need any help with the coding! I'd like to give some thought to how this feature could potentially be implemented as an extension since its usage will be fairly specialized, but since JAMWiki doesn't have an extension architecture yet there's a lot of planning that will be needed first. -- Ryan 17-Nov-2006 07:36 PST
(De-indenting a bit) Ryan, I had not too much time to work on this, hence the slow progress. But I did have some, and studied a bit the different data bases to see how well it would work to have a separate binary field. The conclusion is sadly that most of the data bases have proprietary extensions to handle binary content, most of them have even more than one possible solution... some overlap, some not. So there's no chance to make a default implementation which works with most of them. So then I started to think about the extension idea, and I think it could actually work well. Instead of having 2 fields, I could put a special syntax in the content field, say "ZIPID:12345", and then register some kind of content filter as a plugin, which will intercept the content and look it up in the special zipped_content table under id "12345". Much like a redirect... which could be made a plugin too ;-) Or a spell checker, highlighter, or any other thing which filters content. In fact I think most of the functionality can be made a separate plugin, most of it included in the core, but separated as code. If you go and do this exercise (rip out core functionality in separate plugins), you will end up with an extensible framework :-) (replacing smileys with icons is the next plugin I can think of ;-) ) If you think the extensibility thing is important enough, I can postpone the zip stuff and help pluginizing. I have no experience doing this, but I think there are plenty of good examples out there, eclipse being one which is completely pluginized (all it's functionality is coming packaged in plugins). Studying eclipse's plugin system will help I guess. From what I remember about it, the plugins are attaching to defined "extension points", and provide new "capabilities". The plugin manager will handle dependencies (based on capabilities needed/provided), and plugins can define their own extension points where other plugins can gripe in... I think the user lookup would be such capability, and you could have alternative plugins implementing it like "DB user lookup", "LDAP user lookup", etc... of course it is not that easy to identify all these capabilities and extension points, but doing things step by step would make it easier. For example the different data bases could be different plugins providing the same capabiltiy (DB storage). Things get a bit complicated when a plugin needs special code based on what flavor some capability it uses is: e.g. the content zipping plugin would have different code for different data bases... I have no good idea to solve such problems. It seems that the 2 plugins cannot be cleanly separated, so it might be necessary to add a "flavor" thing to the capabilities, and plugins which depend on a capability do different thing based on the "flavor" of the currently selected capability. OK, I have to stop now. I hope I didn't get too abstract on this... Cheers, Csaba 21-Nov-2006 02:15 PST
JAMWiki 0.5.0 will be the first JAMWiki release to begin a plugin/extension framework. The 0.5.0 UserHandler interface can be used to implement different user authentication mechanisms (LDAP, JAMWiki database, etc), and the new jamwiki-configuration.xml file provides a way of making the system aware of what extensions are available. The parser will also be an extension that can be configured via jamwiki-configuration.xml, and the DatabaseHandler (and all of the associated QueryHandlers) may follow a similar path, potentially allowing someone to implement their own data-storage and retrieval method. Parser tags are more difficult to convert to plugins due to the JFlex architecture, so I'll likely have to choose a different lexer before smiley tags become plugins.
Creating a plugin such as what you are suggesting would (in the currently planned framework) be fairly complex - as you've pointed out. For now I'd prefer to keep things simpler, and add finer-grained plugin/extension abilities as time goes on and we figure out what the issues are going to be. In the mean time I suspect that if an interface is created for a generic "data handler" that you would be able to implement the functionality for your DVD project by creating a custom data handler that stores data in your own preferred format. Development of 0.5.0 is going slowly, but what I would envision is something like the following (this is just an example, and may not be how things are actually implemented):
  • interface DataHandler
    • abstract class DatabaseHandler implements DataHandler
      • class PostgresDatabaseHandler extends DatabaseHandler
      • class HSQLDatabaseHandler extends DatabaseHandler
    • class MyCustomDataHandler implements DataHandler (just an example)
...etc, etc. Your class could either extend DatabaseHandler, HSQLDatabaseHandler, or implement DataHandler on its own, depending on what your needs are. At least, that's my current thinking. It will be a few more weeks before I have working code available, but let me know if that sounds like something that might work for you. -- Ryan 21-Nov-2006 12:27 PST
The problem I see with any design is that some plugins just can't separate cleanly. The most obvious example is data storage and just about any other plugin which needs data storage... either you define a generic data storage interface and plugins use just that (with the disadvantage that there's no clear separation of concerns in the data schema, and performance can not be overly optimized), or you define specific tables for specific tasks, and then all plugins need to define their own tables, which means the plugins need to define data base code additionally to the different data storage plugins which might exist. For exotic data storage needs like storing binary data the plugin's data base code could easily be dependent on the flavor of the data base itself... my problem is that I can't really see a data storage interface that can be used by all future plugins without needing to extend it. And extending it in the new plugin is problematic (as you propose above with MyCustomDataHandler), as you only can extend the existing DB flavors (I wouldn't want to force people to use plain files if they want zipping when they have a preferred DB), and new ones then need to extend themselves for all plugins which do extend data storage, and the separation of concerns is gone... extending the data storage plugin interface itself to accommodate new plugin needs is also problematic, as that would make it unnecessarily big (would contain all code for all plugins, regardless if they are deployed), and a plugin writer who adds a simple plugin could easily be forced to extend 10+ data storage implementations to make it work...
So I guess the only reasonable way to do it is to have specific data structures for core features and provide generic data storage options for external plugins... and probably some similar philosophy would work with other things too, by smoothing the way for core things and provide generic interfaces for other plugins.
Other concerns: plugins will need more interaction with the core than just 1 interface... they could need GUI space in different places (admin GUI, check boxes in the edit page to enable/disable, you name it), and could want to interfere with more than one part of the system. I doubt that can be achieved by simply implementig only one interface... the plugin itself should take action to register itself in different places it needs to take action, all of those can have their own interface.
Now I guess the smiley plugin needs deeper interaction with the parsed content, but things like redirection, zipping, and I think even revision history can be implemented as content storage "interceptors". I will spend some time later to see how feasible this would be to implement it so. I think it would make sense to have a svn branch where I can check in my trials so you can easily review it... I'm used to do lots of throw-away code, just for trying things :-) My user name on sourceforge is ncsaba too. Cheers, Csaba 22-Nov-2006 02:23 PST
Update: I have zipping functioning. I did it using the "second field" approach discussed above... unfortunately I would need to modify for this all the existing data base properties files, possibly some of the classes too... and I don't have all the data bases handy to test. This is one of the difficulties in writing an extension which needs data storage. In any case, it works fine as designed, the original pages are unzipped, the ones I added after the switch of the property are zipped. Now I can start loading it ;-)
On another note: I had to modify a few classes to have jamwiki working on my setup... for some reason the cookies are not set (maybe because the address is localhost, or because it's a non-standard port, or because it's directly tomcat serving the pages ? no idea), and the sessionid which is added by tomcat to compensate is not added to the links. So I modified the LinkUtil#buildInternalLinkUrl to add the session id to the URL, only then I could save the admin preferences (including the zipping option)... another thing was that I set pageInfo.setSpecial(true); in AdminServlet#view(...), otherwise it was throwing an exception on me due to the fact that the topicName was not set.
I'm not sure how I can contribute the changes... I would prefer the above mentioned way, a branch in svn I can check in to. Cheers, Csaba 22-Nov-2006 09:39 PST
I've added your account, so please feel free to create a branch in Subversion. I would ask that for now you please not make commits to the trunk unless the change is for something obvious, like a clear bugfix. If you want to you can use a tool like svnmerge (see How to Help#Programmers) to keep your branch in sync with the trunk. I'll be out of town for the next few days so unfortunately I won't be around to answer questions, but I should be back by Monday or Tuesday. Let me know if you need anything else! -- Ryan 22-Nov-2006 11:16 PST

Importing from mediawiki dump[edit]

Hi all,

I want to set up an off-line wikipedia DVD.

To achieve this I plan to use a small Java program which starts up an embedded tomcat, with JamWiki deployed with the wikipedia pages imported, and an embedded browser component using JDIC. The user should then browse the wiki, and when the browser is closed, tomcat is shut down too. The DVD could contain the JRE for different OS-es (Windows & Linuxes for starters, the rest I can't test), and autorun setup for the different OS-es.

I already managed to get a working JamWiki + h2db + tomcat + web browser (on linux, later I'll make sure it works on windows too, if it doesn't out of the box)...

Now the challenging part: loading a mediawiki dump into JamWiki. I've got the mwdumper sources from the mediawiki svn, and run it on huwiki (hungarian is my native language). The problem is that the DB structure of mediawiki is significantly different from the JamWiki DB structure, and before I delve into solving the impedance mismatch I would like to know if anybody tried something similar...

BTW, h2db works out of the box using the HsqlDB db type (I actually created a H2QueryHandler/sql.h2.properties as a copies of HSQLQueryHandler/sql.hsql.properties, but no changes were needed). I do get some errors though on the admin page now that I'm thinking, which I didn't hunt down yet, but the core functionality seems to work fine.

Just a sidenote, would it make sense to zip the topic text if excedes a certain size ? It would be a relatively simple and transparent addition to the DB code, and it would mak huge savings for big wikis (like the enwiki wikipedia).

Another side note: all the code I write for the offline wikipedia project will be made public under some OS license once I got a base version running, along with description how to build the DVD. Distributing the ready DVD would pose some legal challenges I'm afraid...

Cheers,

Csaba.

The reason for the differences between the Mediawiki database schema and the JAMWiki schema is due to the licenses - since Mediawiki is licensed under the GPL and JAMWiki is licensed under the LGPL I've specifically avoided looking at any Mediawiki source code to avoid any possible licensing issues. As a result, the best option for importing from Mediawiki into JAMWiki might be to use XML. That approach will be much, much slower, but it avoids the need to have the same database format. Mediawiki already has a Special:Export page, and some work was started to implement this functionality in JAMWiki, but there are some technical hurdles to overcome so it's very incomplete - see Special:Import and Special:Export and the supporting code. I would eventually like to to be able to import content from Mediawiki, but due to issues such as how to handle author information (authors on one system probably won't exist on another) it keeps getting pushed back. If you have ideas for fully supporting this feature I'd be interested.
As to zipping the topic text, take a look at Roadmap#Store less text for topics, which has a similar goal. I'm still not sure what the best way to implement this sort of functionality would be, but suggestions and discussion are welcome.
Also, if you get your H2 database instance fully working, please feel free to submit the code! Several people are using that database, and it would be great to have full support for it in JAMWiki. Thanks for the feedback. -- Ryan 08-Nov-2006 11:17 PST
Regarding licencing issues: the mwdump code is Java, and licensed under a MIT style license, so I guess it's OK to even use THAT code only, without peeking at the rest of the mediawiki code. I guess the XML format of the mediawiki dumps is not clobbered either by licensing issues, so that could be used too for import. In any case, a direct import feature from the mediawiki dump would not be necessarily slow if done properly... I will study the jamwiki import/export combined with the mwdump code and try to implement it.
Regarding zipping: zipping is orthogonal to the diff issue, they can be done separately. What I mean is: there must be at least one full version of the article, and that can be zipped. The patches in turn can be zipped or not depending on size. One possible improvement could be to zip all the patches together, that would likely yield better zipping ratio, but somewhat slower access time... but if the full article stored is always the last one, then access time for old versions is usually not a concern.
Regarding h2 and code: I will for sure contribute all my working code, once it IS working and I tested it ;-) -- Cheers, Csaba 09-Nov-2006 03:01 PST
Feel free to add your comments about zipping topic data to Roadmap#Unscheduled if you'd like to so that we have that on the future roadmap. I don't think it would necessarily be a good thing to make zipping topic content a default, since most sites will be more concerned with performance than disk usage, but it would be a good thing to offer as an option. Also, you're right that an XML schema isn't something that would be covered by the GPL, so using Mediawiki's format should be fine and (if I remember correctly) is what the existing import/export code was trying to do. I haven't looked at that code in several months so it is probably no longer working, but it shouldn't be too difficult to implement - problem areas included trying to come up with some way to ensure that author history is maintained, since a user account on one wiki probably won't exist on another, and issues with ensuring ID values were unique.
I'll be looking forward to having H2 fully supported! Let me know if you need any help or have any questions! -- Ryan 09-Nov-2006 14:45 PST
I've implemented the DataHandler interface as discussed on Roadmap#Store less text for topics, which may give you some additional flexibility with your plan to store zipped topic content - I believe it should now be fairly easy to implement a custom DataHandler that overrides any methods you need. The code is in Subversion, please let me know if you have any comments. Please also note that there will be further changes, so don't treat this as a frozen API! -- Ryan 28-Nov-2006 01:12 PST
I've had a look in the new DataHandler interface, and I think it is not exactly in the right direction. The problem is that it does not solve the problem... or I don't understand it properly how it should work. Thing is that the QueryHandler interface still has to have a way to save the zipped field instead of a text field, so only extending the DataHandler is not helping... and the DataHandler is still coming in x flavors, one per DB. That's not useful, I can't extend all DB flavors, and I still want my zip thing work with all DBs... ideally there would be just 1 generic DataBaseHandler, and it would get a QueryHandler which executes all stuff which is DB dependent. Anyway, I will check in my stuff after I manage to merge my changes with the new DB stuff, and then you can take a look at what I have and have a better idea what changes I would have needed...
A better setup would be one which allows to place a chain of filter type things on saving a topic, so we can attach more than one plugin to the chain without ever touching the data storage implementation. That would also work for the retrieval/parsing, I could imagine a progressive parsing process where different parsers (implementing different plugins) chain in to get a final result.
Cheers, Csaba 28-Nov-2006 06:40 PST
I agree that the DataHandler won't solve the problem of creating a universal solution to allow usage of different database schemas with different databases, but as you've pointed out that's a very difficult problem to address. My hope with having a more generic DataHandler is that it adds flexibility for individuals wanting to modify JAMWiki behavior - in the case of using zipped content for data storage, it should hopefully be a bit easier to create a single database handler (for example, H2) that uses zipped content. -- Ryan 28-Nov-2006 11:06 PST