介绍: Lemur(狐猴)系统是CMU和UMass联合推出的一个用于自然语言模型和信息检索研究的系统。在这个系统上可以实现基于自然语言模型和传统的向量空 间模型以及Okapi的ad hoc或者分布式检索,可以使用结构化查询,跨语言检索,过滤,聚类等等。目前最新的版本是3.0,CMU和UMass在9月将推出新的版本 Indri(大狐猴),将加入支持terabyte(1000G就是1T)的数据库和结构化的文档查询(比如将html文档解析为不同的doc representation方式,利用html文档的结构表达方式信息tag, title, meta等)。 运行Lemur需要什么?Lemur可以在windows或者Unix环境下使用,因此我们可以直接在windows下使用lemur。但是lemur提 供了shell script文件来演示完整的使用lemur进行检索的过程,所以在windows下需要安装cygwin来模拟Unix环境。Lemur还提供了一个 GUI程序以及用户交互的界面的CGI,其中有Java程序可以直接看到检索的结果,,因此需要安装Java 虚拟机,CGI程序需要Perl的解释器 下载网址: 双击lemur,可以看到4.3到最新版本;
下载lemur-4.9-install.exe并安装 目录介绍:..\Lemur 4.9\bin\ Lemur Toolkit applications供 直接调用的应用程序脚本即命令行方式,详见 windoc\lemur-applications.html include\ The lemur include files lib\ the lemur library windoc\ Overview of the Lemur Toolkit Overview of the Lemur Toolkit Installed Applications Using the Lemur Toolkit API Indexing Indexing Overview Document Formats Retrieval Batch Retrieval Methods The Indri Query Language for Retrieval The InQuery Query Language for Retrieval src_vs_2005\ 基于MS平台的完整Lemur Toolkit源码 javadoc\ java API document GUI\ RetUI.jar provides a basic document retrieval GUI for interactive queries, using the Indri API. IndexUI.jar provides a basic collection indexing GUI for building an indri repository. LemurRet.jar provides a basic document retrieval GUI for interactive queries using the Lemur API. LemurIndex.jar provides a basic collection indexing GUI for building Lemur indexes. lemur.jar and indri.jar for the Lemur and Indri APIS. doc\ Lemur Toolkit Documentation 如: Namespace List | Class Hierarchy | Alphabetical List | Class List | Directories | File List | Namespace Members | Class Members | File Members | Related Pages CSharp\ The C# wrapper classes assembly will be in LemurCsharp.dll This assembly should be referenced by your C# program. 使用方式: (1)直接拿lemur的程序来使用,即bin\下的可执行程序;(2)Building applications using Visual Studio .NET即直接在自己的项目中调用Lemur库等;After installing the lemur toolkit, you can use the library by adding the subfolder include of the target directory to the "C/C++ / General / Additional Include Directories" property for your project:Next, add the subfolder lib of the target directory to the "Linker / General / Additional Library Directories" property for your project: Next, add lemur.lib and wsock32.lib to the "Linker / Input / Additional Dependencies" property for your project.Also, if your project is configured as "Debug", you should choose the "Multi-threaded Debug DLL(/MDd)" runtime library. If your project is configured as "Release", you should choose the "Multi-threaded DLL(/MD)" runtime library. The installable Lemur Library and applications were built in Release / Multi-Threaded mode.Finally, you should have C/C++ Language Enable Run-Time Type Info set to yes.(3)Compiling the Lemur Toolkit with Visual Studio .NET即对lemur进行修改以符合自己的要求,然后重新编译再调用;The installer can optionally install the full Lemur Toolkit source tree, placing it in the "src_vs_2003" subfolder and/or the "src_vs_2005" subfolder of the target directory, depending on which version(s) of Visual Studio you have installed. That folder contains the Visual Studio solution file "Lemur.sln". There is a separate project file for each library and for each application in Lemur. By default the project configurations are built in "Debug" mode. To change this so that it compiles with fewer warnings and runs at higher efficiency, change the configuration setting in the "Build" menu. Then choose "Configuration Manager". In the menu for "Active Solution Configuration", choose "Release". When built from source, there is a separate library for each of the sub-libraries that are compiled into "lemur.lib". The combined library, "lemur.lib", is built in the lemur subfolder, with output in either Release or Debug, depending on configuration. Important Note: 1。Before compiling the toolkit from the source, you must set the proper include path for the Java library. To modify the library, in the Solution Explorer view, right-click on the "lemur_jni" project and choose "Properties". Set the "Configuration" drop-down box (at the top of the dialog box) to "All Configurations". Next, in the "Additional Include Directories" field, set the appropriate paths to your Java JDK installation's include directory and include/win32 directory. Press the "OK" button when finished, and rebuild. [如果依然不能找到file: 'jni.h',则分别将JDK的include和win32也加入到Additional Include Directories] 2。防止出现类似 error PRJ0008 : 未能删除文件“e:\lemur 4.8\src_vs_2005\app\obj\vc80.pdb”或者不能打开等, 进行设置:即parallel project builds 问题,设maximum number of parallel project builds为1。(双核以上CPU问题?)3。因为lemur有对于阿拉伯文的支持,而在中文系统当中可能会出现字符编码的问题。所以,需要屏蔽掉涉及到阿拉伯文处理的模块。找到parsing模 块下的Arabic_Stemmer.cpp文件,将其中的函数内容全部屏蔽为空。对于返回类型为void型函数,将函数体内容全部注释,对于有返回类型 的函数将整个函数全部注释掉。注意,这里不可删除模块的内容,因为其它的模块会调用相关的接口,如果屏蔽掉接口会导致程序无法通过编译。 使用参考文档: Lemur Toolkit and Indri Search Engine Documentation主要内容:Where to Begin... Overview Compiling and Installing Technical Details Using the Toolkit Toolkit Usage Overview Building Indexes Retrieval Tasks Lemur Toolkit Utilities The Indri Query Language The Lemur CGI Application Programming with the Toolkit Using the Lemur Toolkit with C/C++ Using the Lemur Toolkit with C Sharp Using the Lemur Toolkit with Java Extending the Toolkit Libraries Lemur and Indri for Multilingual Tasks Multilingual Overview Lemur/Indri and Chinese Text Lemur/Indri and Arabic Text Reference Table of Contents The Lemur Toolkit API documentation Site Index from: reference: |