1. 云栖社区>
  2. PHP教程>
  3. 正文

How to Easily Search DOCX, DOC, PDF Documents in PHP Converting them to Plain Text - PHP Cl...

作者:用户 来源:互联网 时间:2017-12-01 14:23:13

How to Easily Search DOCX, DOC, PDF Documents in PHP Converting them to Plain Text - PHP Cl... - 摘要: 本文讲的是How to Easily Search DOCX, DOC, PDF Documents in PHP Converting them to Plain Text - PHP Cl..., << Previous: PHP Articles and Book... Author: Ash Kiswany Posted on: 2015-11-10 Categories: PHP Tutoria

<< Previous: PHP Articles and Book...

Author: Ash Kiswany

Posted on: 2015-11-10

Categories: PHP Tutorials

In the last decades, the massive digitalization of processes has made companies and individuals produce a lot of rich text documents in the DOCX, DOC and PDF formats.

This caused a problem because when we need to search the contents of these documents we need to look at the text content that they contain.

Read this article to learn how to solve the problem of searching and indexing these documents using a PHP class that can easily extract the text contents.

By Ash Kiswany How to Easily Search DOCX, DOC, PDF Documents in PHP Converting them to Plain Text - PHP Cl... < email contact >

Contents Introduction Extracting the Text from Document Files Searching the Document Text Conclusion Introduction

If we want to search DOCX, DOC and PDF files we need first to extract the text they contain. That way we can use the text and save to a database and have the database server perform the searches using query parameters, or we can perform the searches we want to do directly in the text using PHP code.

A solution for extracting the text from this kind of documents can be using the PHP DOC DOCX PDF to Text class . This class can extract text from PDF document files, as well Microsoft Word files, including the older versions that use a proprietary binary file format.

How to Easily Search DOCX, DOC, PDF Documents in PHP Converting them to Plain Text - PHP Cl...

Extracting the Text from Document Files

With this example I will show you how easy is to convert any document to plain text:

<?php require( "class.filetotext.php" ); $docObj = new Filetotext( "test.docx" ); $return = $docObj -> convertToText(); print $return;

As you can see, we include our class file then create a new Filetotext object which takes the file path as its parameter. Then we use convertToText() method on the object which returns the converted text.

Here follow another two examples. It is basically the same thing for any documents in the supported formats.

<?php require("class.filetotext.php"); $docObj = new Filetotext("test1.doc"); $return = $docObj -> convertToText(); print $return; <?php require( "class.filetotext.php" ); $docObj = new Filetotext( "test2.pdf" ); $return = $docObj -> convertToText(); print $return; Searching the Document Text

With this class it is a piece of cake to convert any DOCX, DOC or PDF to plain text. The resulting text may not be suitably formatted for display to users but it can well be used for searching purposes.

In some cases we can also modify the text and then save it back to a document file of the original format using with another class that can generate documents in the format we want.

Anyway, for searching purposes the text can be stored in any database, so we can perform searches of multiple documents with a single query. If you use a database like MySQL you can store the text in a field with an associated full text index.

CREATE TABLE documents ( id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY, filename VARCHAR(255), contents TEXT(65535), FULLTEXT search_index (contents)) ENGINE=InnoDB;

The we can use the MATCH expression to perform SELECT queries to search for given text inside the document text contents.

SELECT id, filename FROM documents WHERE MATCH (contents) AGAINST ('search keywords here' IN NATURAL LANGUAGE MODE);

If you just want to search a single document text, you can use for instance the strpos PHP function to search for some keywords in the text.

if( strpos( $contents, 'search keywords here') !== false){ echo 'Keywords found!';} Conclusion

The PHP DOC DOCX PDF to Textclass can make any document easy to search. It is a very good tool for creating a big database of books online where you can use the text of this documents to be crawled by search engines.

If you liked this article or have questions about extracting and searching text from document files, post a comment here.

You need to be aregistered user orlogin to post a comment

以上是云栖社区小编为您精心准备的的内容,在云栖社区的博客、问答、公众号、人物、课程等栏目也有的相关内容,欢迎继续使用右上角搜索按钮进行搜索,以便于您获取更多的相关知识。