MarpX Search-and-Filter Specifications

Overview

MarpX is a technology of Marpex Inc., Steubenville OH, that opens access to information within bodies of text. MarpX combines search and proximity-based filtering within word collections of any size. The user is given simple yet direct control to filter out instances in which desired terms are considered too far apart. That combined with taking into account headings ensures delivery of results most likely to be meaningful to the user.

Content owners who use this technology have the option of rendering results in two stages, first identifying documents (books, articles, legislation, patents, etc.) or collections that contain the user-specified search terms close together. The user may then choose among examining the hits in particular documents, acquiring a copy of a document, and/or researching the hits most likely to be meaningful in the entire collection.

Page Top

Search AND Filter

Human languages express meaning by placing words in relationship with one another. When words are far apart, they are usually not related. It is meaningless to deliver search results in which the desired terms are paragraphs apart. (Exception: Words in headings do help to convey meaning.) MarpX takes the nature of language into account by removing from the list of search results all hits in which the words are further apart than the user specifies.

Filtering out meaningless hits requires more work by the computer. But it saves the user the bother of sorting through piles of meaningless hits. Let the computer do the extra work, not the person.

MarpX's search-and-filter combination tool delivers fewer hits than conventional technology. But who needs millions of hits when a few dozen good hits can be found through precision search and filtering?

Page Top

User Experience

User specifies a search: There are four boxes into which terms may be entered... the standard search box or the three boxes for advanced search: all of these words, any of these words, none of these words. Each box may take up to 15 words. There should be no punctuation between the words. Exception: double quotes. Any two or more words in any box may be enclosed in double quotation marks, in which case the system will look for that exact phrase. There may be multiple phrases requested in each box. In the three boxes of advanced search, as many as 31 words in total may be used. There must be at least one word in a box other than the "not these words" box.

User views summary results: In many applications, the searchable content may be made up of distinct documents -- ebooks or books for publishers, articles for archivists, patents or laws or bills for legislative or law professionals, etc. Each document has a distinct name. For such collections, search and filtering leads to a list of the documents, ranked according to the number of hits found within each. The document with the most hits is at the top. Other documents with progressively fewer hits follow. If there are over one hundred documents with hits, typically only the first one hundred are shown.

The summary results page may be skipped for other collections, such as super-sized web sites. A web site of under 3,200,000 words could be treated either way, either as a single document, or as a miscellaneous component in a massive collection.

User views detail results for one document: One option is to drill down from the summary results and see all, or up to the first 500, hits within a document of interest to the user. This filtered list of paragraphs is ranked starting with instances in which the words are closest together. The desired words, highlighted in colors, become progressively further apart as one proceeds down the list. Arranged in this order, the most meaningful hits tend to cluster toward the top of the list.

Each paragraph is preceded by the hierarchy of headings and possibly a page number that locate the hit within the document. Location of each hit within a document improves the usefulness of the information to the user. Each paragraph is followed by a count of the words that intervene between the terms requested by the user. If there are multiple hits within a paragraph, the best one is counted. In some collections, each paragraph is followed also by an image or button that links to a web page from which the full document may be obtained.

User views the best of the best detail results in all the documents: A button on the summary results page enables the user to drill down through all of the documents, in order to research a topic across all of the documents in the collection. Response time may be slightly longer for this option, since the search and filter system is reviewing every possibility in every document that contains hits.

User views document list: At the content owner's discretion, the user may be enabled to click on the collection title and be shown an instant list of up to 4096 documents within the current collection.

Page Top

The 4k Pyramid

MarpX search-and-filter technology tracks distance between words in successively larger collections of text.

Level One collections may each contain up to 3.2 million words. Often these collections are smaller when the intent is to draw user attention to a specific document -- book, legislative action, patent, web site, whatever. Level One collections are described in Marpex Inc.'s U.S. Patent Number 7,433,893. Level One collections are present on the server, no matter what the ultimate size of collections of collections; all exposure of content to users is derived from these Level One files.

Level Two collections are comprised of from one to 4,096 Level One collections. The theoretical size limit for Level Two is 4,096 times 3,276,800 = 13.4 billion words. Level Two and all higher levels are based on identical techniques of content preparation, search, filtering, relevance ranking, etc.

Levels Three and higher are each aggregations of up to 4,096 units of the previous level. Theoretical word count limits spiral upward -- 55 trillion words at Level Three, etc. Computer equipment needs begin to mount at Level Three; we recommend a separate server for each underlying Level Two collection. RAM capacity and computing power for preparation also increase with aggregate collection size.

Page Top

Server Requirements

Windows PC versions exist for Levels One through Three. In theory, a Level Three collection (for example, of all U.S. patents) could be delivered on a portable hard disk for use with a PC. But the primary focus is on server based search and filtering.

To date, MarpX has used Windows servers.

Depending on traffic, a single server may be adequate for a Level Two collection. Going higher, we recommend one server for each Level Two collection. Typical RAM requirements are under a megabyte per instance, but may peak at about one and a half times the size of the group's largest Level One collection.

Page Top

Content

MarpX search and filtering is designed for pattern recognition among words. MarpX can handle word descriptions of images, but cannot search and filter the images themselves. So too, scripts of audio and video content can be processed, but not the audio and video itself. Of course, any content that can be included on a web page (images, audio, video, etc.) can be presented as a part of search and filtering results. In that sense, the user may look for, find, and immediately view and hear content that accompanies the words.

Most usage of MarpX has been with English text. European languages are supported by standard HTML ampersand codes for accented characters. While a rewrite to accommodate sixteen bit characters (Japanese, Chinese, Korean) is theoretically possible, no move in that direction is anticipated.

The recursive design of MarpX suggests that it could be applied to full Internet search. That would require significant investment by any company thinking of an Internet-wide implementation. Would precision filtering be attractive in an Internet search? More than likely, yes, especially to end users seeking to carry out research rather than casual search. Underlying question: In today's environment, are there niche markets that would provide the base to monetize Internet-scope implementations?

Preparation of content amounts to reduction of incoming text to printable ASCII with sufficient HTML tags to distinguish headings and paragraph boundaries. Optionally, tags may be added to support tables, lists, images, videos, etc. This preparation has been fully automated for ebooks. A variety of text extraction tools have been developed by Marpex Inc. over the years, with varying quality of results. C++ source code will be put into public domain on request for any particular extractor or preparation tool. The ideal world is to build a custom extractor from these tools, should the content format be consistent, as it typically is for most newspapers and magazines, for the Congressional Record, for patents, etc.

Page Top

Five Distinctives of MarpX

Simplicity:

  • The look and feel fits in with users' expectations. The only difference is that zero, one, or many phrases may be specified, by enclosing any two or more words in double quotes.
  • An eight year old can understand the scoring for relevance; just count the words in between the highlighted terms.
  • There is no mountain of false hits to dig through; the good results tend to be nearest the top of the list.

Meaning: Conventional search engines use large targets for their search -- a web page, an entire PDF file, a patent, or whatever. Some of the words are pretty far apart. The result is lots of false hits, that is, combinations of the words that you want that are too far apart to be related in meaning. MarpX uses a small target -- a 100 word block, plus a few words in case of overlap into the next block, plus the headings related to that bit of text. If any of the words requested by the user are in headings rather than the text, that's reported as a hit, but at the bottom of the list. The most meaningful hits are those with the search terms really close together; those appear at the top of the list.

Respect:

  • MarpX keeps tabs on what people ask for, but makes no record about the user. Tracking users is a polite way of saying what most conventional engines do. Marpex Inc. as a matter of principle shows respect by choosing not to spy on customers.
  • MarpX gives customers what they ask for, not what some search engine designer thinks they should be given.

Transparency: As pointed out above, a child can figure out why one hit is placed before or after another in a list of results. It's a matter of counting the words in between the highlighted words that were requested. There is no way to game the system. There will be no Search Engine Optimization industry growing up around MarpX. The hit with the desired words closest together will appear at the top of the results. End of discussion.

Power: On hearing a description of MarpX, one old gentleman put it crudely, but accurately: "Oh, you filter out the crap." Yes. That's the magic of a combination search-and-filter engine. And MarpX tracks word separation distances even in massive quantities of text. That's powerful! That it does so much more work than conventional engines, yet does it so fast, that's the power of efficient design at every point within MarpX technology.

Page Top

Current R&D

Something old: Marpex Inc. invented the original FindIt CD-ROM search engine for Reteaco Inc. in 1984. Its interface included a display of every search term along with its frequency. We are building that same feature into MarpX interfaces. That will make it possible to enable mobile touch devices to make it really easy for users to select terms. Apps can be written to handle vocabulary lists, then send the search specification to a server to do the heavy lifting online.

"Something old" will include fielded search, standard in the original FindIt. Effect: The user can limit search and filtering to the full text associated with specific authors, titles, publication dates, or other document metadata.

Something new: MarpX will use the skills and knowledge it has developed in back-of-the-book indexing and its 32 years in search to introduce an entirely new form of search-and-filter operations. The working name: beyond concept search. That's all we've got to say about that (for now).

Page Top

Legal

Again, something old: MarpX in its recursive technology to search and filter progressively greater quantities of text uses a variation of the index structure invented by Marpex Inc. in 1984 for Reteaco Inc. Simple legal point: Nobody can legally challenge a work that has been in the marketplace since mid-1985.

Something patented: A strange thing about MarpX technology -- there is no text on the server. Text is routinely set aside after an index is built; it is never sent to the server. All text shown in a search-and-filter result is reconstituted on the fly from the index, held in a "PX1" (MarpX Level One) file. These super-compressed indexes are described in U.S Patent number 7,433,893. Enough said (we hope).

Page Top