Current development on JAMWiki is primarily focused on maintenance rather than new features due to a lack of developer availability. If you are interested in working on JAMWiki please join the jamwiki-devel mailing list.

Tech:Performance

ktip.png This page (and all pages in the Tech: namespace) is a developer discussion about a feature that is either proposed for inclusion in JAMWiki or one that has already been implemented. This page is NOT documentation of JAMWiki functionality - for a list of documentation, see Category:JAMWiki.
Status of this feature: IMPLEMENTED. Performance updates are a part of the majority of JAMWiki releases and will continue to be part of future releases, but the improvements outlined below have primarily been implemented for releases prior to and including JAMWiki 1.0.
Contents

Description[edit]

Author(s)[edit]

Status[edit]

Primary key auto-increment[edit]

Code to allow Postgres and MySQL to auto-generate primary keys has been committed to trunk. A brief review of documentation online indicates the following:

HSQL
Does not support the getGeneratedKeys method, although support is being added for version 1.9 (current in the release candidate stage).
MS SQL
Supports IDENTITY column types, but I don't see an easy way to support upgrading existing installations - comments online indicate "You cannot change the identity property on an existing column. You can do it on new columns only."
Oracle
Appears to support auto incrementing only through the use of triggers.

At the moment support for this feature will be provided for Postgres and MySQL only. Anyone interested in adding support for other databases is welcome to do so provided the limitations pointed out above can be overcome. -- Ryan • (comments) • 08-Aug-2009 17:11 PDT

Other changes[edit]

revision 2665 adds caching of data from the jam_wiki_user table - previously each page load queried this table twice. -- Ryan • (comments) • 09-Aug-2009 10:51 PDT

Comments[edit]

Comments from dfisla[edit]

  1. My modifications to jamwiki are mostly at the DataHandler and DataQuery level to handle large data sets. WikiResultSet implementation will not work.
  2. I had to modify some of the sequel to work with auto-generated IDs, using max aggregates (or any aggregates) for determining next sequence numbers does not scale well past 10,000 topics.
  3. I had to implement content compression and decompression at the data api level to boost MySQL read performance. Mostly to shrink row size so that I could fit as many rows as possible into memory.
  4. MySQL does not support full Unicode with utf8_general_ci collation for TEXT, CLOB fields, I switch the DDL to use utf8_bin collation, this is a well know bug/limitation of MySQL
  5. For performance reasons I am using InnoDB 5.4 compiled on 64bit platform, I implemented bulk-loader with partitioning to allow for concurrent rebuilding of jamwiki without table locks (I went back and forth on MyISAM and InnoDB many times). Excluding wiki redirects, I can parse the wikipedia dump (22 gigs) under 3 hours and load it into jamwiki with proper topic, topic_version, and recent change entities. I am working on category building as the existing approach is really slow.
  6. I am only focusing on MySQL 5.4 because none of the other databases meet my speed requirements. Although, the ansi SQL changes may not be compatible with other databases, I never verified this. I know ORACLE will need triggers for auto-generated IDs.
  7. I switched to log4j, although I don't mind going back to commons logging api, log4j is something I use in all of my projects so I felt more at home, let me know what you think.
Some challenges:
  1. The java flex based parser is great except it is much slower than the bliki parser, which makes sense as it is doing more work in terms of parsing and validating the content
  2. The java flex based parser just does too much work on the data, when parsing an article such as "Anarchism" it does close to 400+ DB Api calls, even with caching and my machine capable of 5-6,000 qps - parsing 5 million+ topics just takes too long.
  3. The flex based parser lex syntax is too strict and wikipedia content is bit loose. The parser often goes into infinite loops and performs very poorly. Wikipedia content is really messy, reminds me of html 3.x or 4.x where html parsers have to support broken standards just to render pages.
  4. I am currently looking into using the latest bliki parser to update the addons module.
  5. I am currently looking at replacing ehcache with distributed hash table data grid like hazelcast. I spent some time looking at hypertable and hbase, and few other caches and distributed bigtable systems, they all suffer from loss of performance under frequent writes/updates. Awsome for reading, not for rebuilding 15+ gigs of data.

Initial Response[edit]

Some thoughts:

  • Regarding points #2 and #3 I'd be very interested if you could point out any Mediawiki articles that are particularly problematic - I'd like to make sure any fixes in the default parser get fixed.
  • There definitely hasn't been any optimization done for very large data sets, so any insights and code that help would be great. The only concern would be making sure that any changes are either in database-specific code (such as MySqlDataHandler) or safe to implement across databases (and thus safe for the generic code such as AnsiDataHandler).
  • ehcache was chosen a while back based on a few reviews, but if there's a better solution and it can be easily swapped in then it's definitely worth investigating.

Thanks for your investigations - hopefully these changes can make their way into the next release! -- Ryan • (comments) • 28-Jul-2009 20:44 PDT

I've finally gotten a bit of time to review the changes, and here are some early comments:
  • Could you merge the latest trunk code? It looks like we're out of sync by about a week. Alternatively, if there's no plan to use Subversion's merge tools then being out of sync probably doesn't matter much.
  • Very minor point, but again assuming merging would be done with Subversion's tools it would be great if the new code could use tabs for indentation to match existing style.
  • In the past I've gotten bug reports due to people having different versions of log4j already in their paths, which is why the SDK logging is used. I wouldn't be against building some sort of configurable logger, but I'd prefer not to switch back to log4j completely to avoid any support issues.
  • I really like the RETURN_GENERATED_KEYS approach, although I've got some concerns about how well that will be supported across databases. I assume that for MySQL you had to change the primary key column types? I know originally I was using sequences in Postgres for this same purpose, but as the user base grew it was too tough to make that work across databases. Do you have any suggestions for how best to handle this issue? Would it be sufficient to simply keep a map of next available IDs that could be used instead of repeatedly querying the database? That would break if someone manually updated the database, but that shortcoming could be partially overcome by reloading the map anytime a SQL error occurs.
  • The logging changes in WikiResultSet and WikiPreparedStatement look good and are nice cleanups.
That's all I got through tonight, but I'll spend more time looking at your code over the weekend. -- Ryan • (comments) • 30-Jul-2009 23:03 PDT

Follow-up Response [dfisla][edit]

  • Regarding parsing issues I will re-run my tests and will get back to you.
  • The changes to the AnsiDataHandler and AnsiQueryHandler were done in such a way that would allow me to use single code base to test both the new and old methods without having to setup two separate environments. I agree these two classes need to be cleaned up more and MySQL things should move to the MySqlDataHandler and the MySqlQueryHandler classes.
  • On second thought, using ehcache is fine as its primary purpose is to be used by the web tier with infrequent updates. For topic parsing/processing purposes it makes sense to avoid ehcache and look at something else. Again, I don't think this applies to the jamwiki project as it's not aimed at data mining wikipedia content, so ehcache is great for its current purpose.
  • With respect to being in sync with the trunk. I have to maintain my own internal svn repository and use your trunk as a branch. I have to manually inspect all changes, apply them, and perform regression testing. After that, I take my changes and do the same when applying them back to my development branch in jamwiki where I test again before committing. As you can see this takes some time and effort so I don't expect to be in sync with the trunk.
  • With log4j, the configurable logger interface would be nice. I'll look into this, if not possible will revert back.
  • With respect to the schema changes, DDL, and queries, the way things were just would not scale. As far as I know the RETURN_GENERATED_KEYS is part of the JDBC interface (JDBC 4 I think). So any native jdbc 4 driver implementing the spec should work, of course there may be exceptions. The thing is ORACLE, DB2, Postgres, MySQL and Derby all support auto increment values/ids. I think it would make sense to focus only on compliant DB vendors, you already do this indirectly by sticking to ANSI SQL. In any case, implementd ID generation at the virtual machine/client level is doable, (Hibernate and other OR-Mappers do this). This issue with this approach is that it does not work across VMs. Again, when it comes to loading and parsing massive volumes of data, distributed VMs/Servers are a must. FYI, mediawiki schema uses auto increment ID values as well. :-)
  • In conclusion, I would still consider the code as 'alpha' and it was designed with the objective of minimal changes to the existing code base and approach. I needed to get things off the ground and be able to parse the 8M+ topics in English wikipedia in minimum time.
  • The good news is I am able to load 4M+ topics (I drop redirects) in 2.5 hours, and rebuild all categories in 16 hours (this includes deletes, with truncated tables it's even faster). Again, these are not exact benchmarks just rough numbers.
Hi Daniel - if you aren't concerned about keeping your branch in sync with trunk then most of the merging issues above can be ignored (including any log4j issues) and we can look into manually merging some of your changes. I'd like to start with a few of the less-intrusive changes first, such as the changes to log additional info in some of the database classes, and will try to get these merged at some point today or tomorrow.
With respect to RETURN_GENERATED_KEYS, since it was added with the 1.4 JDK it should be supported by most JDBC drivers. My concern was more about what changes need to be make to the table structures to support auto-incrementing primary keys; I'll do a bit more reading on this today. Also - awesome that you're able to load that number of topics. I would eventually like to do more systematic performance testing with JAMWiki, but time constraints have unfortunately prevented setting that up. -- Ryan • (comments) • 01-Aug-2009 13:44 PDT
Based on a quick bit of reading today it doesn't appear that auto incrementing is supported in a standardized way, so revision 2654 moves the "next id" logic out of the data handler and into the query handler. It should thus now be easy to write a database-specific QueryHandler methods that take advantage of auto-incrementing, but the default AnsiQueryHandler will still maintain ANSI-compliance. I'll take a stab at putting an example together using Postgres as soon as possible. -- Ryan • (comments) • 01-Aug-2009 19:02 PDT
Independent of the parser used (jflex, bliki) I agreed that addiional test framework is needed. I plan to develop a test/bench module that would test few things like loading, rebuilding, parsing and produce some reports. The only thing is we will need 100K+ topics for it to be relevant. These could be table dumps which should compress nicely. This way all developers and users can provide some stats on different DMBS platforms.
I also noticed on the special page that lists all categories uses a distinct select clause, which pretty much kills the server on 5M+ topics. —The preceding comment was added by dfisla (commentscontribs) .
Some sort of automated or semi-automated performance testing would be awesome - thanks for investigating. I am nearly positive that you're the first person to investigate JAMWiki with hundreds of thousands of topics, so I suspect you may find a number of bottlenecks such as the category page. Overall the entire category implementation may need significant work to scale to those kinds of loads.
With respect to the code in your branch, I've got auto-incrementing halfway implemented with Postgres using the SERIAL data type and will check this into trunk as soon as possible - probably in the next few days, depending on what the day job workload is this week. It more-or-less follows the example you've put together for MySQL, so hopefully it will meet your performance requirements and we can then also merge a version for use with MySQL and other databases. -- Ryan • (comments) • 02-Aug-2009 22:37 PDT
revision 2662 adds auto-incrementing support for Postgres, and also converts the insert methods to use native java.sql.* types instead of the org.jamwiki.db.* equivalents. The approach taken is:
  1. Add a new QueryHandler.autoIncrementPrimaryKeys() method. AnsiQueryHandler returns false for this method, but database-specific QueryHandlers can override that behavior, in which case the various insert methods will assume that no primary key is specified and will instead assume that the database will auto-generate one.
  2. Update the database-specific SQL property file to add appropriate CREATE_TABLE statements as well as UPGRADE statements so that the primary key types will be of the appropriate auto-increment type.
  3. Update the DatabaseUpgrades class and the UPGRADE.txt file to support upgrading existing installations.
Postgres seems to work well with this change, and in an unscientific test I did see slightly improved performance. Time permitting I'll take a stab at adding support for MySQL and potentially HSQL later today. Feedback is appreciated. -- Ryan • (comments) • 08-Aug-2009 09:26 PDT
revision 2663 adds support for MySQL AUTO_INCREMENT. I didn't test extensively, but upgrade and install from scratch both seem to work. -- Ryan • (comments) • 08-Aug-2009 14:25 PDT

WikiCache[edit]

Copied from an email:

Hi Ryan,

I took a look at WikiCache class again, including ehcache, and everything looks good.

I noticed the CACHE_PARSED_TOPIC_CONTENT is used to cache parsed topic content, which is great except it is currently only used to cache very few topics such the stylesheet, the left nav menu, etc...

I noticed for complex topics/articles, the parser has to lookup all referenced topics and the number of queries can be really large. I know the topic lookups are also cached using another ehcache cache, however I would like to use the PARSED_TOPIC_CONTENT cache to store at least 1 million of the most complex topics (combination of topic content size and the highest number of outgoing links).

Fo some reason, I cannot get ehcache to persist across VM/tomcat restarts. Not sure why, but the CacheManager always deletes the old caches even when I specific the cache is disk persistent and is eternal. Code below:

            Configuration configuration = new Configuration();
            CacheConfiguration defaultCacheConfiguration = new CacheConfiguration();
            defaultCacheConfiguration.setDiskPersistent(true);
            defaultCacheConfiguration.setEternal(true);
            defaultCacheConfiguration.setOverflowToDisk(true);
            defaultCacheConfiguration.setMaxElementsInMemory(Environment.getIntValue(Environment.PROP_CACHE_TOTAL_SIZE));
            defaultCacheConfiguration.setName("defaultCache");
            logger.info(String.format("CACHE-CONFIG => Name: %s DiskPersistent: %s Eternal: %s OverflowToDisk: %s",
                    defaultCacheConfiguration.getName(),
                    defaultCacheConfiguration.isDiskPersistent(),
                    defaultCacheConfiguration.isEternal(),
                    defaultCacheConfiguration.isOverflowToDisk()));
            configuration.addDefaultCache(defaultCacheConfiguration);
            DiskStoreConfiguration diskStoreConfiguration = new DiskStoreConfiguration();
            diskStoreConfiguration.setPath(directory.getPath());
            configuration.addDiskStore(diskStoreConfiguration);
            WikiCache.cacheManager = new CacheManager(configuration);

What I really need is to pre-build those 1 Million top topics as an ehcache data file which I can then use for already parsed pages. I know I could do this with setting up apache proxy with mod_jk connector, however I need the cache to remove and/or update any topics being changed by the users.

Another option I was thinking of is using the lucene search index to store the already parsed html topic content, this index is already being maintained to be in sync with all the edits and already stores all topic content anyway.

Anyway, I guess I would like to figure how to make ehcache persist data files across VM restarts.

—The preceding comment was added by dfisla (commentscontribs) .

Hy Ryan, I got this working. I found some old ehcache code I was playing with that had cache persistence enabled.
New WikiCache code and config files attached. :-) It makes sense to switch to the ehcache.xml config files instead of re-creating all of the options through properties file. BTW, this also works for maven build/unit tests.
—The preceding comment was added by dfisla (commentscontribs) .
Hi Daniel - I'm still looking through the ehcache.xml change. An initial question I have is how using a separate EhCache file would affect ease of setup and configuration. Currently nearly everything is configurable via the Special:Admin interface, and a user can setup a JAMWiki instance with just a few steps - there is no need for a default install to configure multiple files. Is there anyway to implement persistent EhCache caches without the need for a separate file? Alternatively, can the information for this file be written by JAMWiki without the need for generating an EhCache-specific XML file? If we can't control this configuration via a web interface an alternative would be to have a switch that will use the jamwiki.properties settings as a default, but allow the user to specify that he would rather use an ehcache config file for advanced users. Let me know your thoughts. -- Ryan • (comments) • 29-Aug-2009 12:25 PDT
One more thing, and this may be a dumb question, but simply removing the line:
   WikiCache.cacheManager.removalAll();
from WikiCache.initialize() isn't sufficient to cause the cache to persist across app server restarts? I haven't tried that yet, but intuitively it seems like a disk cache should persist automatically unless it is cleared. -- Ryan • (comments) • 29-Aug-2009 12:33 PDT
—The preceding comment was added by dfisla (commentscontribs) .
Yep, that did it. I found it right after I sent my email. :-) Your're right, I had to remove the cache configuration from the Admin manage interface. In theory, yes you can generate the XML file, also ehcache references xml schema so you can validate the generated xml config as well. I see your point about the skill set required to install, configure, and maintain JAMWiki. The ehcache.xml file can be shipped as is, and I doubt most people would need to modify the default config. Perhaps, as a part of the install/setup process the install code can update the disk path to folder where JAMWiki gets installed and where the cache files will be stored.
I personally, would go with default ehcache.xml file and maybe just put some static link on the admin page where to find the file. Another approach, would be as you suggest, perhaps, the initialization method in WikiCache can check if ehcache.xml on disk exists and load that file, if not, work with the configuration properties in the jamwiki.properties file. The tricky part is updating the admin interface to detect the ehcache.xml file on disk, so that a message stating the config properties are ignored could be displayed. Otherwise, it could be really confusing to end users when updating caching config params and the cache behavior would not change. :-)

Update from dfisla[edit]

Copied from an email:

Hi Ryan,

just wanted to give you an update on few things. After taking 2 weeks off, I went back to jamwiki and started working on performance again. As it turns out, my biggest challenge is always maximizing CPU utilization (especially on multi-core systems, btw I absolutely love the core i7 platform) and minimizing I/O. Previously, I had implemented a multi-threaded bulkloader and I had to implement a multi-threaded lucene indexer that can produce N separate indexes which are then merged into single optimized index. Just to give you an idea, before it would take about 5 days to index 5M+ topics, which I was able to do in 13 hours using the concurrent approach. The resultant lucene index is about 22 Gigs large so I was really excited about the results and the speed.

But, I had to do some significant changes to the QueryHandler interface. First, I had to remove WikiResultSet and WikiPreparedStatement completely. I understand the desire to have a detached result set, however the WikiResultSet implementation in the trunk cannot even hold a simple list of 5M Integers using a 3 gig heap space. Also, the copying of results from jdbc ResultSet to WikiResultSet in the constructor was a performance hit as well. I had to refactor the internal workings of the AnsiDataHandler implementation, but I was able to keep the DataHandler Interface unchanged.

I plan to sync my changes into the dfisla branch in the jamwiki repository, I will need few days to cleanup the code a bit.

Finally, I plan to host a sandbox running my version of jamwiki with all of the wikipedia content, including search, etc... This is something I spent the last 2 months working towards. I still have some performance issues when loading uncached articles. The database is really large (20+ Gigs) and when data the cannot be cached in RAM, MySQL does too many seeks even when scanning indexes.

BTW, my current code performs better than the MediaWiki's code (PHP or JAVA), especially when it comes to loading, re-building categories, and indexing.

—The preceding comment was added by dfisla (commentscontribs) .

Thanks for the continued investigations - having never tried running JAMWiki code on large installations I'm finding it really interesting to see your progress. With respect to a few of the specific issues that you've raised:
  1. Investigating removal of WikiResultSet and WikiPreparedStatement may make sense at this point. Looking through the logs, those two classes were introduced in revision 103 and revision 243 (summer 2006) when functionality was significantly simpler. At the time the goal was to eliminate the need to deal with connection handling and avoid the possibility of accidentally leaving a connection open, but at this point there is a hodgepodge of methods that take Connection objects as arguments and those that don't, so it could be better to just standardize on the native java.sql.* functionality. I've already converted a few methods to use java.sql.* methods based on your auto_increment suggestion above, and patches to update additional methods (provided they do not add signficant code complexity) would be welcomed. Please note that I'd like to start stabilizing for the 0.8.0 release, so any significant changes may need to wait for the 0.9.x development cycle.
  2. I'll look forward to your next round of branch commits. I tried to cherry pick through the last batch and implemented the auto_increment feature you suggested (with a few tweaks) and would appreciate any feedback - the implementation in trunk is slightly different from yours but works across the databases I've tested. If there are additional changes from your branch that you feel are important please call them out so that we can figure out how to get those into trunk.
  3. I'm definitely looking forward to seeing your sandbox - if it's something you're willing to share publicly then I'd be more than happy to post links from jamwiki.org to it so that it could be used as an example of what's possible, and if it will be permanently available then adding a link to it from the JAMWiki article on Wikipedia might also make sense.
Glad to hear your work is going well, and I'm definitely interested in integrating it back into trunk where appropriate. My major concerns are simply making sure that changes work across different databases and app servers, that the code remains readable, and that we don't sacrifice configurability and ease-of-use in a rush to implement performance updates - at this point the majority of the JAMWiki user base seems to be smaller installations looking for an easy-to-setup-and-configure wiki, and while I would definitely like to make sure things scale to large systems I want to make sure it's done in a way that doesn't affect those users. That approach may mean that things move a bit slower, but hopefully it will work better in the long run. -- Ryan • (comments) • 29-Aug-2009 12:12 PDT
Quick update - revision 2684 removes a few uses of WikiPreparedStatement in cases where the method was being passed a Connection object. Time permitting I'll see if I can track down a few more. -- Ryan • (comments) • 29-Aug-2009 16:45 PDT
revision 2687 converts another batch of org.jamwiki.db.WikiPreparedStatement uses to instead utilize java.sql.PreparedStatement. -- Ryan • (comments) • 30-Aug-2009 16:10 PDT

Updates for JAMWiki 0.9.0[edit]

  • revision 2782 significantly improves the performance of AnsiDataHandler.writeTopic() on my local setup by cleaning up some of the code used to update the search engine index. There is room for further improvement, but this change offers approximately an 8x speedup (from 0.4s to 0.05s) on my local box when updating the search index during a topic update. -- Ryan • (comments) • 07-Dec-2009 07:44 PST
  • revision 2783 changes the LuceneSearchEngine class so that it caches the IndexSearcher object (per the Lucene guidelines) rather than creating a new instance every time a search is performed. -- Ryan • (comments) • 07-Dec-2009 23:08 PST
  • revision 2784 improves the performance of AnsiDataHandler.writeTopic() by another 5-20%. Previously the code would either add or update the topic record, then add a new topic version record, and then perform a second update on the topic record to set the current version id. This change eliminates the first update, performing a topic update only after all required information is available. -- Ryan • (comments) • 07-Dec-2009 23:08 PST
  • revision 2787 further improves the performance of AnsiDataHandler.writeTopic() by avoiding an unnecessary query for the topic ID for each category associated with the topic. I didn't benchmark this one, but for topics with multiple categories I'd suspect a performance boost of perhaps 5-10% (numbers pulled magically out of the air - could be significantly less or significantly more). -- Ryan • (comments) • 09-Dec-2009 22:08 PST
  • In order to avoid the need to deal with details of database connections JAMWiki originally used two wrapper classes, WikiResultSet and WikiPreparedStatement. However, since there is some performance overhead involved in wrapping this functionality, and because these are both used very, very frequently I've removed them in favor of the native ResultSet and PreparedStatement classes. -- Ryan • (comments) • 22-Dec-2009 08:49 PST
  • revision 2824 removes an unnecessary parsing run, improving topic parsing speed 10-15% in my local benchmarks. Previously the code was doing an extra parser pass to determine if a topic was a redirect, but the necessary information was available from previous parser passes. -- Ryan • (comments) • 02-Jan-2010 22:28 PST
  • revision 2825, revision 2827, revision 2829 and revision 2833 significantly speeds up XML topic imports. Changes include removing some regular expressions, changing the interface to allow parsing very large XML data, removing recent change record creation, and adding two methods to allow creating a topic version without updating the topic record, and then batch updating topic versions to correctly set the previous_topic_version_id value. -- Ryan • (comments) • 17-Jan-2010 09:36 PST
  • revision 2839 updates the code to use PreparedStatement.executeBatch() when adding categories to a topic, which should provide a marginal speed boost when adding/updating topics with multiple categories. -- Ryan • (comments) • 17-Jan-2010 09:26 PST
  • revision 2840, revision 2841 and revision 2842 update the search engine to remove a synchronized method, cache the IndexWriter instances, and reduce the number of search documents created for each topic from two to one. Numbers varied greatly, but in my local benchmarks this change improved search engine update times by 5-20%. Further testing is probably needed. -- Ryan • (comments) • 17-Jan-2010 09:26 PST
  • revision 2976 improves topic history retrieval by a factor of 100 (from 1.24 seconds to 0.014 seconds) on my test instance (containing 1.4 million topic version records). The only change was to add a database index for topic_id on the jam_topic_version table. -- Ryan • (comments) • 27-Mar-2010 10:18 PDT

Proposed Optimization[edit]

Currently a significant pain point in parsing comes up when parsing links of the form [[Topic Name]], particularly it seems in the User: and Template: namespaces. In each of these cases the query must check if lower(topic_name) = the link name, something that isn't indexed. I've been debating changing the way topic names are stored in the database to allow more flexibility with namespaces, and I'm proposing the following change to the structure of jam_topic:

virtual_wiki_id   namespace_id    topic_name    page_name    page_name_lower    delete_date

This approach allows easier management of language-specific namespaces, and a database index can be created for both page_name and page_name_lower to optimize performance (since not all databases allow an index such as lower(page_name) the additional column is a hackish workaround). It might also make sense to move deleted topics into their own table to simplify management, but that's probably another discussion. Thoughts? -- Ryan • (comments) • 07-Feb-2010 22:44 PST

This change has mostly been implemented on my personal branch in Subversion. There have been extensive changes to make namespaces easier to work with, and with another couple of days of work I should have some benchmarks available on performance, which I expect will be noticeably improved. Pending any major disasters this work will be merged to trunk very soon, so if anyone has comments or objections please air them now. -- Ryan • (comments) • 16-Mar-2010 16:00 PDT
This change has now been merged to trunk. -- Ryan • (comments) • 22-Mar-2010 23:19 PDT

Updates for JAMWiki Performance branch [dfisla] 0.9.0[edit]

I have merged the [dfisla] performance branch changes with the trunk [jamwiki 0.9.0], in preparation for the 0.9.0 release. The performance changes in the performance branch can be summarized as follows:

  • Topic Logs (log history) are not supported - existing behavior consists of comments. The Item Log required another table to perform join on when fetching topic/topic version information and would add additional performance overhead.
  • The data model has most constraints disabled. The constraints are enforced in the application code instead.
  • The data model has additional index to make sure InnoDB never does table scans.
  • Most SQL queries have sort order removed.
  • The LONGTEXT data fields store compressed data. With 5M+ topics controlling the record length determines how many records InnoDB engine can store in memory/cache. It is better to utilize CPU for compression/decompression operations instead of I/O.
  • The data provider for MySQL has been enhanced with additional interface methods that allow to perform CRUD operations on topics and topic version independently. Also, fetching of identifiers, and fetching other data columns without the topic content have been implemented. Fetching all topic/version columns can degrade performance.
  • When web page (topic) is parsed and sent for display to the web client upon each request, a fully parsed HTML output is cached. Older implementation relied on ehcache. The latest version uses MySQL table jam_topic_cache to store parsed data. It is simple exercise to dump data from the table into ehcache index files if desired.
  • The ehcache storing the parsed HTML would grow really large as well. As a result, should for any reason the app server needed to be killed/restarted, the large ehache files become corrupt as they were not cleanly closed/shutdown. Upon next restart, ehache automatically deletes such files and essentially the cache files are lost. Given the large size of ehcache files need to sufficiently support 5M+ topics, it is time consuming to restore the ehcache files. As a result, future version of the performance branch will interface with ehcache using a standalone cache server (or external cluster of ehcache servers) configuration (glassfish v3+ehcache). This way, should the front-end web servers needed to be killed/restarted the ehcache server remain operational and preserve the integrity of the cache files.

uniBlogger - Jamwiki performance sandbox

Merge Status (04-February-2010)[edit]

Merging the dfisla branch is underway, although it's slow going. The status to this point is the following:

  • Trunk is mostly merged to the dfisla branch, although as new changes are added additional merges will need to be done.
  • I'd like to start by trying to integrate the cache changes from Daniel's branch - the dfisla branch changes can't be merged directly since they change JAMWiki behavior, but it should be possible to implement a config setting to allow use of ehcache.xml configuration, as is done in his branch.
  • The dfisla branch eliminates all support for Special:Logs, so this is either an area that can't be merged or one which will need to be updated on the dfisla branch. I suspect it's something that will remain different between the two branches.
  • There are other minor schema changes (example: jam_recent_change uses change_date on trunk but edit_date on dfisla) that may be able to be updated on the dfisla branch.
  • I need to look more closely at the compression work done on the dfisla branch to see if it's something that should be merged.

In general I'd like to merge code that does not affect existing JAMWiki functionality and is useful for databases other than just MySQL. Any suggestions and help in this effort is appreciated. -- Ryan • (comments) • 04-Feb-2010 22:16 PST

revision 2873 should merge all functionality from the dfisla branch for WikiCache. Some comments while merging the code:
  • Is WikiCache.flushCache() needed? I didn't see it referenced anywhere in the code, but if it's needed for some other reason it can be added. For now I've left it out.
  • WikiCache will now look for an ehcache.xlm file, and if one exists it will be used by default, otherwise the default cache configuration (as configured from Special:Admin) will be used.
  • I've added an /WEB-INF/classes/ehcache-sample.xml file. This mostly follows Daniel's example but is updated with the latest version from http://ehcache.org/ehcache.xml.
Feedback is appreciated, but hopefully this is sufficient for the required cache functionality. Suggestions about how to merge additional code from the dfisla branch would also be appreciated. -- Ryan • (comments) • 07-Feb-2010 16:10 PST


Database Note[edit]

This may be outdated - I am using quite an old version of Jamwiki (18 months at least) with a lot of custom features. I had one incredibly significant boost in efficiency and memory usage with a simple change - I stripped out all the built in connection pooling code and moved it to a more simple utilisation of apache commons' DBCP package. No more 'select 1' queries and such a significant improvement that I thought I would recommend you try it out here. I run a fairly large site and get about 150K uniques on the wiki each month, sustained memory usage dropped from close to 900MB for my entire app (including lots of non wiki stuff) to 400MB. Apologies if this code has already been removed in later versions, but if not, try it...

To clarify, (since it does already use DBCP) the code that works better is the datasource/pool initialisation:

    public static DataSource ds = null;
    private static GenericObjectPool _pool = null;

    public Connection getConnection() throws SQLException {
	  return ds.getConnection();
    }

    private void connectToDB() {
        try {
            java.lang.Class.forName(driverName).newInstance();
        }
        catch(Exception e) {
            LOG.error("Error when attempting to obtain DB Driver: " + driverName + " at " + new java.util.Date().toString(), e);
        }

        try {
            StandardConnection.ds = setupDataSource(
                    databaseURL,
                    user,
                    password,
                    minpoolsize,
                    maxpoolsize);
        }
        catch(Exception e) {
            LOG.error("Error when attempting to connect to DB ", e);
        }
    }

   public static DataSource setupDataSource(String connectURI, String username, String password, int minIdle, int maxActive) throws Exception {
        //
        // First, we'll need a ObjectPool that serves as the
        // actual pool of connections.
        //
        // We'll use a GenericObjectPool instance, although
        // any ObjectPool implementation will suffice.
        //
        GenericObjectPool connectionPool = new GenericObjectPool(null);

        connectionPool.setMinIdle( minIdle );
        connectionPool.setMaxActive( maxActive );
        connectionPool.setMaxWait( maxWait );

        StandardConnection._pool = connectionPool; 
        // we keep it for two reasons
        // #1 We need it for statistics/debugging
        // #2 PoolingDataSource does not have getPool()
        // method, for some obscure, weird reason.

        //
        // Next, we'll create a ConnectionFactory that the
        // pool will use to create Connections.
        // We'll use the DriverManagerConnectionFactory,
        // using the connect string from configuration
        //
        ConnectionFactory connectionFactory = new DriverManagerConnectionFactory(connectURI,username, password);

        //
        // Now we'll create the PoolableConnectionFactory, which wraps
        // the "real" Connections created by the ConnectionFactory with
        // the classes that implement the pooling functionality.
        //
        PoolableConnectionFactory poolableConnectionFactory = new PoolableConnectionFactory(connectionFactory,connectionPool,null,null,false,true);

        PoolingDataSource dataSource = new PoolingDataSource(connectionPool);

        return dataSource;
    }
Thanks for the code! Early versions of JAMWiki actually used something similar to what's above, but there were some bug reports with transactions on some databases so it was converted to use Spring to make transaction handling easier. Since it seems there may be some performance issues it sounds like it may need revisiting. Do you perchance remember approximately what the performance improvement was for the simpler approach? I'd be very curious about any benchmarks, although I suspect those might need to be generated when new code is written :) -- Ryan • (comments) • 02-Mar-2010 20:41 PST
I'm afraid that the benchmarks mentioned above are about as good as I can give. We were getting as many as 1.1 million 'select 1' queries performed each 24 hours, so obviously without those happening any more things are much faster. Cant really offer anything concrete except CPU usage dropped from 80% to 65% across our whole webapp and the number of DB queries dropped significantly. Worth investigating when you get the time though!