Quarkus Tika
This guide explains how your Quarkus application can use Apache Tika to parse the documents.
Apache Tika is a content analysis toolkit which is used to parse the documents in PDF, Open Document, Excel and many other well known binary and text formats using a simple uniform API. Both the document text and properties (metadata) are available once the document has been parsed.
If you are planning to run the application as a native executable and parse documents that may have been created with charsets different than the standard ones supported in Java such as
|
Docker
When building native images in Docker using the standard Quarkus Docker configuration files some additional features need to be
installed to support Apache POI. Specifically font information is not included in Red Hat’s ubi-minimal images. To install it
simply add these lines to your DockerFile.native
file:
FROM registry.access.redhat.com/ubi8/ubi-minimal:8.9
######################### Set up environment for POI ##########################
RUN microdnf update && microdnf install freetype fontconfig && microdnf clean all
######################### Set up environment for POI ##########################
WORKDIR /work/
RUN chown 1001 /work \
&& chmod "g+rwX" /work \
&& chown 1001:root /work
# Shared objects to be dynamically loaded at runtime as needed,
COPY --chown=1001:root target/*.properties target/*.so /work/
COPY --chown=1001:root target/*-runner /work/application
# Permissions fix for Windows
RUN chmod "ugo+x" /work/application
EXPOSE 8080
USER 1001
CMD ["./application", "-Dquarkus.http.host=0.0.0.0"]
Make sure .dockerignore does not exclude .so files!
|
Prerequisites
To complete this guide, you need:
-
less than 20 minutes
-
an IDE
-
JDK 11+ installed with
JAVA_HOME
configured appropriately -
Apache Maven {maven-version}
-
Docker
Solution
We recommend that you follow the instructions in the next sections and create the application step by step. However, you can go right to the completed example.
Clone the Git repository: git clone https://github.com/quarkusio/quarkus-quickstarts.git
, or download an archive.
The solution is located in the tika-quickstart
directory.
The provided solution contains a few additional elements such as tests and testing infrastructure. |
Creating the Maven Project
First, we need a new project. Create a new project with the following command:
mvn io.quarkus.platform:quarkus-maven-plugin:3.8.4:create \
-DprojectGroupId=org.acme.example \
-DprojectArtifactId=tika-quickstart \
-DclassName="org.acme.tika.TikaParserResource" \
-Dpath="/parse" \
-Dextensions="tika,resteasy-reactive,awt"
cd tika-quickstart
This command generates a Maven project, importing the tika
and resteasy-reactive
extensions.
We also import the AWT extension as some of the Tika operations done in this guide require it.
If you already have your Quarkus project configured, you can add the extensions to your project by running the following command in your project base directory:
./mvnw quarkus:add-extension -Dextensions="tika,resteasy-reactive,awt"
This will add the following to your pom.xml
:
<dependency>
<groupId>io.quarkiverse.tika</groupId>
<artifactId>quarkus-tika</artifactId>
<version>2.0.4</version>
</dependency>
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-resteasy-reactive</artifactId>
</dependency>
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-awt</artifactId>
</dependency>
Examine the generated JAX-RS resource
Open the src/main/java/org/acme/tika/TikaParserResource.java
file and see the following content:
package org.acme.tika;
import jakarta.ws.rs.GET;
import jakarta.ws.rs.Path;
import jakarta.ws.rs.Produces;
import jakarta.ws.rs.core.MediaType;
@Path("/parse")
public class TikaParserResource {
@GET
@Produces(MediaType.TEXT_PLAIN)
public String hello() {
return "hello";
}
}
Update the JAX-RS resource
Next update TikaParserResource
to accept and parse PDF and OpenDocument format documents:
package org.acme.tika;
import java.io.InputStream;
import java.time.Duration;
import java.time.Instant;
import jakarta.inject.Inject;
import jakarta.ws.rs.Consumes;
import jakarta.ws.rs.POST;
import jakarta.ws.rs.Path;
import jakarta.ws.rs.Produces;
import jakarta.ws.rs.core.MediaType;
import io.quarkus.tika.TikaParser;
import org.jboss.logging.Logger;
@Path("/parse")
public class TikaParserResource {
private static final Logger log = Logger.getLogger(TikaParserResource.class);
@Inject
TikaParser parser;
@POST
@Path("/text")
@Consumes({"application/pdf", "application/vnd.oasis.opendocument.text"})
@Produces(MediaType.TEXT_PLAIN)
public String extractText(InputStream stream) {
Instant start = Instant.now();
String text = parser.getText(stream);
Instant finish = Instant.now();
log.info(Duration.between(start, finish).toMillis() + " mls have passed");
return text;
}
}
As you can see the JAX-RS resource method was renamed to extractText
, @GET
annotation was replaced with POST
and @Path(/text)
annotation was added, and @Consumes
annotation shows that PDF and OpenDocument media type formats can now be accepted. An injected TikaParser
is used to parse the documents and report the extracted text. It also measures how long does it take to parse a given document.
Run the application
Now we are ready to run our application. Use:
./mvnw compile quarkus:dev
and you should see output similar to:
$ ./mvnw clean compile quarkus:dev
[INFO] Scanning for projects...
[INFO]
INFO] --------------------< org.acme.example:apache-tika >--------------------
[INFO] Building apache-tika 1.0.0-SNAPSHOT
[INFO] --------------------------------[ jar ]---------------------------------
...
Listening for transport dt_socket at address: 5005
2019-10-15 14:23:26,442 INFO [io.qua.dep.QuarkusAugmentor] (main) Beginning quarkus augmentation
2019-10-15 14:23:26,991 INFO [io.qua.dep.QuarkusAugmentor] (main) Quarkus augmentation completed in 549ms
2019-10-15 14:23:27,637 INFO [io.quarkus] (main) Quarkus 999-SNAPSHOT started in 1.361s. Listening on: http://0.0.0.0:8080
2019-10-15 14:23:27,638 INFO [io.quarkus] (main) Profile dev activated. Live Coding activated.
2019-10-15 14:23:27,639 INFO [io.quarkus] (main) Installed features: [cdi, resteasy-reactive, tika]
Now that the REST endpoint is running, we can get it to parse PDF and OpenDocument documents using a command line tool like curl:
$ curl -X POST -H "Content-type: application/pdf" --data-binary @target/classes/quarkus.pdf http://localhost:8080/parse/text
Hello Quarkus
and
$ curl -X POST -H "Content-type: Content-type: application/vnd.oasis.opendocument.text" --data-binary @target/classes/quarkus.odt http://localhost:8080/parse/text
Hello Quarkus
Building a native executable
You can build a native executable with the usual command ./mvnw package -Dnative
.
Running it is as simple as executing ./target/tika-quickstart-1.0.0-SNAPSHOT-runner
.
Configuration Reference
Configuration property fixed at build time - All other configuration properties are overridable at runtime
Type |
Default |
|
---|---|---|
The resource path within the application artifact to the Environment variable: |
string |
|
Comma separated list of the parsers which must be supported. Most of the document formats recognized by Apache Tika are supported by default but it affects the application memory and native executable sizes. One can list only the required parsers in Either the abbreviated or full parser class names can be used. Only PDF and OpenDocument format parsers can be listed using the reserved 'pdf' and 'odf' abbreviations. Custom class name abbreviations have to be used for all other parsers. For example:
This property will have no effect if the `tikaConfigPath' property has been set. Environment variable: |
string |
|
Controls how the content of the embedded documents is parsed. By default it is appended to the main document content. Setting this property to false makes the content of each of the embedded documents available separately. Environment variable: |
boolean |
|
Configuration of the individual parsers. For example:
Environment variable: |
|
|
Full parser class name for a given parser abbreviation. For example:
Environment variable: |
|