跟益达学Solr5之索引文件夹下所有文件-阿里云开发者社区

跟益达学Solr5之索引文件夹下所有文件

2016-05-16 2986

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

上篇我们学习了如何从PDF文件中提取文本进行索引，今天我们来学习如何对一个文件夹下所有文本文件进行索引。废话不多说，我直接贴相关配置：

首先依然是在solrconfig.xml配置文件中配置dataimport请求处理器，并指定data-config.xml配置文件加载路径：

   Xml代码  
    
  
<requestHandler name="/dataimport" class="solr.DataImportHandler">  
    <lst name="defaults">  
      <str name="config">data-config.xml</str>  
    </lst>  
</requestHandler>  

指定依赖的jar包加载路径：

   Xml代码  
    
  
<dataDir>C:\solr_home\core1\data</dataDir>  
<lib dir="./lib" regex=".*\.jar"/>  

依赖的jar包如图：

然后重点是配置我们的data-config.xml了，配置内容如下：

   Xml代码  
    
  
<dataConfig>  
    <dataSource name="fileDataSource" type="FileDataSource" />  
    <!--  
    <document>  
        <entity name="tika-test" processor="TikaEntityProcessor"  
                url="C:/docs/solr-word.pdf" format="text">  
                <field column="Author" name="author" meta="true"/>  
                <field column="title" name="title" meta="true"/>  
                <field column="text" name="text"/>  
        </entity>  
    </document>  
    -->  
    <dataSource name="urlDataSource" type="BinURLDataSource" />  
    <document>  
            <entity name="files" dataSource="null" rootEntity="false"  
            processor="FileListEntityProcessor"  
            baseDir="c:/docs" fileName=".*\.(doc)|(pdf)|(docx)|(txt)"  
            onError="skip"  
            recursive="true">  
                <field column="fileAbsolutePath" name="filePath" />  
                <field column="fileSize" name="size" />  
                <field column="fileLastModified" name="lastModified" />  
                   
                <entity processor="PlainTextEntityProcessor" name="txtfile" url="${files.fileAbsolutePath}" dataSource="fileDataSource">  
                    <field column="plainText" name="text"/>  
                </entity>  
        </entity>  
    </document>   
</dataConfig>  

baseDir表示获取这个文件夹下的文件，fileName支持使用正则表达式来过滤一些baseDir文件夹下你不想被索引的文件，processor是用来生成Entity的处理器，而不同Entity默认会生成不同的Field域。FileListEntityProcessor处理器会根据指定的文件夹生成多个Entity,且生成的Entity会包含fileAbsolutePath, fileSize, fileLastModified, fileName这几个域，recursive表示是否递归查找子目录下的文件，onError表示当出现异常时是否跳过这个条件不处理。

然后我们需要在schema.xml中定义域，

   Xml代码  
    
  
<field name="userName" type="string" indexed="true" stored="true" omitNorms="true"/>    
   <field name="sex" type="boolean" indexed="true" stored="true" omitNorms="true"/>    
   <field name="birth" type="cndate" indexed="true" stored="true" omitNorms="true"/>   
   <field name="salary" type="int" indexed="true" stored="true" omitNorms="true"/>  
  
   <field name="text" type="text_ik" indexed="true" stored="true" omitNorms="true" multiValued="false"/>  
   <field name="author" type="string" indexed="true" stored="true" />  
   <field name="title" type="string" indexed="true" stored="true" />  
  
   <field name="fileName" type="string" indexed="true" stored="true" />  
   <field name="filePath" type="string" indexed="true" stored="true" required="true" multiValued="false" />  
   <field name="size" type="long" indexed="true" stored="true" />  
   <field name="lastModified" type="cndate" indexed="true" stored="true" />  
   <!-- Only remove the "id" field if you have a very good reason to. While not strictly  
     required, it is highly recommended. A <uniqueKey> is present in almost all Solr   
     installations. See the <uniqueKey> declaration below where <uniqueKey> is set to "id".  
     Do NOT change the type and apply index-time analysis to the <uniqueKey> as it will likely   
     make routing in SolrCloud and document replacement in general fail. Limited _query_ time  
     analysis is possible as long as the indexing process is guaranteed to index the term  
     in a compatible way. Any analysis applied to the <uniqueKey> should _not_ produce multiple  
     tokens  
   -->     
   <field name="id" type="string" indexed="true" stored="true" required="false" multiValued="false" />   

到此，配置工作就完毕了，在C:/docs目录下准备几个txt文件用于测试，注意，txt文件编码请保证是UTF-8编码，默认txt文件的编码是GBK,这是很多小白容易犯的错误，特此提醒！！！！！！

然后重启你的tomcat,执行索引导入，如图：

照例，切换到Query菜单进行查询测试，如图：

OK,大功告成！本篇博客示例配置文件以及测试用的txt文件我待会儿会上传到底下附件(由于jar包体积太大，附件里不会包含jar包，包含完整jar的，我会上传到我的百度网盘)。