inicio sindicaci;ón

Zend_Search (Java Lucene)

From: Natalie Kather

We have implemented Zend_Search into our content management system “Click and Change” and we would like to share our experiences with the developers’ community. For this purpose, we decided to publish the complete source code and it’s documentation.

Please feel free to use the attached code and modify it to your requirements.

OVERVIEW

  • Interesting Problems / Special Features
  • Documentation
  • To do list
  • Your Feedback / Questions
  • Download

INTERESTING PROBLEMS / SPECIAL FEATURES

1. Special Characters
At the moment, Zend_Search doesn’t support any other characters than those from the ASCII charset. Since we needed to search for German umlauts (ä, ö, ü), we had to encode them into ASCII characters. First, we wanted save them like encoded HTML special chars (& auml; for ‘ä’), but you cannot search for ‘&’ either.

This is why I decided to encode for instance ‘ä’ as ‘xxxaexxx’, ‘ö’ as ‘xxxoexxx’ and so on. As Zend_Search doesn’t permit a search for numbers and some other special characters, I encoded them as well (’1′ as ‘xxxonexxx’, ‘=’ as ‘xxxequalxxx’, etc.)

The attached code only includes the most common special chars in German language, so feel free to modify it to your requirements.

see search_index::simplify($string) and search_index::unsimplify($string) for details

Furthermore, the engine doesn’t accept capital letters in the query, so I turned them to small letters. The search always is case insensitive.

2. Search Options

By default, Zend_search does not search in every field of the document, so you need to make it do so.

$hits = $index->find(’headline:’ . $query . ‘ contents:’ . $query. ‘ link:’ . $query);

3. Large Indices

Dealing with lists containing mor than a thousand addresses, we had the best results using a segment size of a hundred entries (websites with content) per segment.

int search_index::segment_size
manages your segment size

4. List complete Index

Note the following little trick: If you want to get all your websites simply search for ‘http’ ;)

DOCUMENTATION

Class diagram

Files (click on a file name to view the source code):

include.php

Configure constants such as the paths to store your indices in and the sources of your link lists.
For now, it only contains dummy links that do not work!!!

index.php
This is the search engine. Enter your query and choose your index you want to search in.

main_modify.php
A neat configuration tool to manage your indices in your browser via web front end. Choose source (XML sitemap delivered by google or link list [only links within the tags marked in the HTML code - your navigation links are ignored]) and target (a specific index) of your links, and decide what you want to do (overwrite/create new index or append to existing index).

Read more about Google’s Sitemap Generator: http://www.google.com/webmasters/sitemaps/docs/en/sitemap-generator.html

Use multiple indices to either search your whole website or only the news archive, f. ex.

main_search.php
This file receives its data via method=post from the form in index.php

modify.php
This file receives its data via method=post from the form in main_modify.php

page.php
class page; class page_found extends page; class page_add extends page
These classes manage your result pages and your pages you want to add.
string page_found::cut(string $string) cuts your content after 100 characters, but always at the end of a word
void page_found::highlight() is quite a neat function to highlight your search results case insensitively. Before I wrote it myself, I had searched the web for such a tool, but nowhere found what I needed.
void page_add::pick_content() extracts the content area between the tags and from a file and removes html characters.
void page_add::pick_title() extracts the title of a html file between the < title > -tags
search_index.php
class search_index
Manage affairs concerning your index, such as appending a list of pages (check if a website does exist before creating a new page_add object) , creating a new index or searching the index.

Added 2006-10-12: When searching the index, only the best MAX_RESULTS (you can change this constant in include.php) results are returned. This is because:

1. Browsing your results on several single pages is not supported at this time. You could work around it caching the results in a database etc., but we don’t want users’ websites access a database.

2. Users hardly ever have a look at results beyond the first, let’s say 30, results; so a limitation of about 50 results makes sense.

string search_index::simplify(string $string) formats strings to the index’ format, i.e. convert special characters and encode from UTF-8 to ASCII (see also string search_index::unsimplify(string $string) for the other way round)

void search_index::append(page_add $list[]) save time by only adding pages that are not yet in the index

sites_list.php
class sites_list
manage everything concerning your list of addresses you want to add to the index
grab links from a Google XML sitemap or a simple html file and format them afterwards (create absolute links, remove session ids, filter mailto links etc)

TO DO LIST

This tool does not (yet) support these functions:

  • At the moment, the tool permits either the search for one word or several words combined by the OR operator.
    The script modifys phrases such as “word1 word2″ (mind the quotation marks) to word1 OR word2.
    A query for +word1 +word2 usually return no results.Implementing an “AND” search might be difficult, since a query like “at least one of the fields title, contents, links must contain word1 and at least one of the fields title, contents, links must contain word2″ is not supported yet. You could only work around it searching only in the field ‘contents’.The same goes for a term query: The engine doesn’t seem to support a term query in multiple fields. You might work around it by executing two or more queries (on in the content, one in the title etc) and mixing the results.
  • You cannot search for URLs such as http://mysite.com/about_me.html since we do not encode slashes, colons and full stops. However, if you have websites such as http://mysite.com/entry.php?ID=12345, a request for ID=12345 returns the correct result.
  • List your results on several pages. For now, all your results are shown on a single page.

YOUR FEEDBACK / QUESTIONS

Please post your feedback as comment in this blog or write an e-mail to Natalie. Please understand that we cannot give you support for problems which are not resulting from our script.

DOWNLOAD

>> Download Release 1.1 (2006-10-12) as a ZIP-File (10 KB)

27 Kommentare to “Zend_Search (Java Lucene)”

  1. Nio’s Weblog » Zend_Search (Java Lucene) Says:

    […] via: NorthClick: We have implemented Zend_Search into our content management system “Click and Change” and we would like to share our experiences with the developers’ community. For this purpose, we decided to publish the complete source code and it’s documentation. [….] […]

  2. siliman Says:

    Hi;

    Very well done search with zend. I also had the same problems, but never thought to change the string presentations for sloven language.

    Making;

    static protected $char_old;
    static protected $char_new;

    static function simplify(){}
    static function unsimplify(){}

    Might also make sense if you do not want to create the object.

    Nice Job.

  3. s Says:
  4. s Says:
  5. s Says:
  6. s Says:
  7. NpGRACE Says:

    Some people do not know how to find the thesis sample just about this post. Therefore, we could recommend your fantastic tought. But some of them would search for the thesis writing.

  8. essays online Says:

    When people stuck with sample essay composing, thence I will propose to buy custom essay at some good paper writing service in such situation.

  9. personal loans Says:

    I strictly recommend not to hold off until you get enough money to buy all you need! You should take the loans or term loan and feel yourself free

  10. ClubPenguinCheats Says:

    I also had the same problems, but never thought to change the string presentations for sloven language.

  11. Weight Loss Resources Says:

    Hello I found this blog by Google; it’s by this post I found songs for what I was looking for. Thanks for the great post looking for more quality post.

  12. Slimming Pills Reviews Says:

    This is definitely what I have been looking for. Thanks

  13. Weight Loss Reviews Says:

    Great site

  14. swimming pool Says:

    please show me code make “Submit Comment”, i have done it but it not working

  15. Bağlama Büyüleri Says:

    please show me code make “Submit Comment”, i have done it but it not working

  16. metin2 yang Says:

    never give up behold the chance as legend

  17. metin2 yang kaufen Says:

    made by f0rest thanks for sharing

  18. metin2 yang Says:

    nice post i like it

  19. wow gold Says:

    made by veless

  20. Acai Berry Says:

    hey buddy,this is one of the best posts that I’ve ever seen; you may include some more ideas in the same theme. I’m still waiting for some interesting thoughts from your side in your next post.

  21. Acai Max Cleanse Review Says:

    This is one of the best posts that I’ve ever seen; you may include some more ideas in the same theme. I’m still waiting for some interesting thoughts from your side in your next post.

  22. Acai Berry Colon Cleanse Says:

    Hey this is really nice information. I was looking for something similar like this. Thanks for this useful information.

  23. How To Get Pregnant Fast Says:

    I really liked the post and the stories are really thanks for sharing the informative post.

  24. Slimming Pills Reviews Says:

    I think it is good service! thanks

  25. Weight Loss Reviews Says:

    I really loved reading your thoughts, obviously you know what are you talking about! Your site is so easy to use too, I’ve bookmark it in my folder :-D

  26. Weight Loss Reviews Says:

    I really loved reading your thoughts, obviously you know what are you talking about! Well done!

  27. Diet Pills Reviews Says:

    Nice website! I adore a few from the articles which have been written, and particularly the comments posted!

Leave a Reply

*
To prove that you're not a bot, enter this code
Anti-Spam Image