TailFile

TailFile 2.5.0

Bundle: org.apache.nifi | nifi-standard-nar
Description: "Tails" a file, or a list of files, ingesting data from the file as it is written to the file. The file is expected to be textual. Data is ingested only when a new line is encountered (carriage return or new-line character or combination). If the file to tail is periodically "rolled over", as is generally the case with log files, an optional Rolling Filename Pattern can be used to retrieve data from files that have rolled over, even if the rollover occurred while NiFi was not running (provided that the data still exists upon restart of NiFi). It is generally advisable to set the Run Schedule to a few seconds, rather than running with the default value of 0 secs, as this Processor will consume a lot of resources if scheduled very aggressively. At this time, this Processor does not support ingesting files that have been compressed when 'rolled over'.
Tags: file, log, source, tail, text
Input Requirement: FORBIDDEN
Supports Sensitive Dynamic Properties: false

Additional Details for TailFile 2.5.0
TailFile

Introduction

This processor offers a powerful capability, allowing the user to periodically look at a file that is actively being written to by another process. When the file changes, the new lines are ingested. This Processor assumes that data in the file is textual.

Tailing a file from a filesystem is a seemingly simple but notoriously difficult task. This is because we are periodically checking the contents of a file that is being written to. The file may be constantly changing, or it may rarely change. The file may be “rolled over” (i.e., renamed) and it’s important that even after restarting the application (NiFi, in this case), we are able to pick up where we left off. Other additional complexities also come into play. For example, NFS mounted drives may indicate that data is readable but then return NUL bytes (Unicode 0) when attempting to read, as the actual bytes are not yet known (see the property), and file systems have different timestamp granularities.

This Processor is designed to handle all of these different cases. This can lead to slightly more complex configuration, but this document should provide you with all you need to get started!

Modes

This processor is used to tail a file or multiple files, depending on the configured mode. The mode to choose depends on the logging pattern followed by the file(s) to tail. In any case, if there is a rolling pattern, the rolling files must be plain text files (compression is not supported at the moment).
- Single file: the processor will tail the file with the path given in ‘File(s) to tail’ property.
- Multiple files: the processor will look for files into the ‘Base directory’. It will look for file recursively according to the ‘Recursive lookup’ property and will tail all the files matching the regular expression provided in the ‘File(s) to tail’ property.
Rolling filename pattern

In case the ‘Rolling filename pattern’ property is used, when the processor detects that the file to tail has rolled over, the processor will look for possible missing messages in the rolled file. To do so, the processor will use the pattern to find the rolling files in the same directory as the file to tail.

In order to keep this property available in the ‘Multiple files’ mode when multiples files to tail are in the same directory, it is possible to use the ${filename} tag to reference the name (without extension) of the file to tail. For example, if we have:

/my/path/directory/my-app.log.1 /my/path/directory/my-app.log /my/path/directory/application.log.1 /my/path/directory/application.log

the ‘rolling filename pattern’ would be ${filename}.log.*.

Descriptions for different modes and strategies

The ‘Single file’ mode assumes that the file to tail has always the same name even if there is a rolling pattern. Example:

/my/path/directory/my-app.log.2 /my/path/directory/my-app.log.1 /my/path/directory/my-app.log

and new log messages are always appended in my-app.log file.

In case recursivity is set to ’true’. The regular expression for the files to tail must embrace the possible intermediate directories between the base directory and the files to tail. Example:

/my/path/directory1/my-app1.log /my/path/directory2/my-app2.log /my/path/directory3/my-app3.log

Base directory: /my/path Files to tail: directory[1-3]/my-app[1-3].log Recursivity: true

If the processor is configured with ‘Multiple files’ mode, two additional properties are relevant:
- Lookup frequency: specifies the minimum duration the processor will wait before listing again the files to tail.
- Maximum age: specifies the necessary minimum duration to consider that no new messages will be appended in a file regarding its last modification date. If the amount of time that has elapsed since the file was modified is larger than this period of time, the file will not be tailed. For example, if a file was modified 24 hours ago and this property is set to 12 hours, the file will not be tailed. But if this property is set to 36 hours, then the file will continue to be tailed.
It is necessary to pay attention to ‘Lookup frequency’ and ‘Maximum age’ properties, as well as the frequency at which the processor is triggered, in order to achieve high performance. It is recommended to keep ‘Maximum age’ > ‘Lookup frequency’ > processor scheduling frequency to avoid missing data. It also recommended not to set ‘Maximum Age’ too low because if messages are appended in a file after this file has been considered “too old”, all the messages in the file may be read again, leading to data duplication.

If the processor is configured with ‘Multiple files’ mode, the ‘Rolling filename pattern’ property must be specific enough to ensure that only the rolling files will be listed and not other currently tailed files in the same directory ( this can be achieved using ${filename} tag).

Handling Multi-Line Messages

Most of the time, when we tail a file, we are happy to receive data periodically, however it was written to the file. There are scenarios, though, where we may have data written in such a way that multiple lines need to be retained together. Take, for example, the following lines of text that might be found in a log file:
```
2021-07-09 14:12:19,731 INFO [main] org.apache.nifi.NiFi Launching NiFi... 
2021-07-09 14:12:19,915 INFO [main] o.a.n.p.AbstractBootstrapPropertiesLoader Determined default application properties path to be '/Users/mpayne/devel/nifi/nifi-assembly/target/nifi-1.14.0-SNAPSHOT-bin/nifi-1.14.0-SNAPSHOT/./conf/nifi.properties' 
2021-07-09 14:12:19,919 INFO [main] o.a.nifi.properties.NiFiPropertiesLoader Loaded 199 properties from /Users/mpayne/devel/nifi/nifi-assembly/target/nifi-1.14.0-SNAPSHOT-bin/nifi-1.14.0-SNAPSHOT/./conf/nifi.properties 
2021-07-09 14:12:19,925 WARN Line 1 of Log Message 			Line 2: This is an important warning. 			Line 3: Please do not ignore this warning. 			Line 4: These lines of text make sense only in the context of the original message. 
2021-07-09 14:12:19,941 INFO [main] Final message in log file
```
In this case, we may want to ensure that the log lines are not ingested in such a way that our multi-line log message is not broken up into Lines 1 and 2 in one FlowFile and Lines 3 and 4 in another. To accomplish this, the Processor exposes the property. If we set this Property to a value of \d{4}-\d{2}-\d{2}, then we are telling the Processor that each message should begin with 4 digits, followed by a dash, followed by 2 digits, a dash, and 2 digits. I.e., we are telling it that each message begins with a timestamp in yyyy-MM-dd format. Because of this, even if the Processor runs and sees only Lines 1 and 2 of our multiline log message, it will not ingest the data yet. It will wait until it sees the next message, which starts with a timestamp.

Note that, because of this, the last message that the Processor will encounter in the above situation is the “Final message in log file” line. At this point, the Processor does not know whether the next line of text it encounters will be part of this line or a new message. As such, it will not ingest this data. It will wait until either another message is encountered (that matches our regex) or until the file is rolled over (renamed). Because of this, there may be some delay in ingesting the last message in the file, if the process that writes to the file just stops writing at this point.

Additionally, we run the chance of the Regular Expression not matching the data in the file. This could result in buffering all the file’s content, which could cause NiFi to run out of memory. To avoid this, the property limits the amount of data that can be buffered. If this amount of data is buffered, it will be flushed to the FlowFile, even if another message hasn’t been encountered.

Properties

State Location
Specifies where the state is located either local or cluster so that state can be stored appropriately in order to ensure that all data is consumed without duplicating data upon restart of NiFi
Display Name

State Location

Description

Specifies where the state is located either local or cluster so that state can be stored appropriately in order to ensure that all data is consumed without duplicating data upon restart of NiFi

API Name

File Location

Default Value

Local

Allowable Values
- Local
- Remote
Expression Language Scope

Not Supported

Sensitive

false

Required

true
File(s) to Tail
Path of the file to tail in case of single file mode. If using multifile mode, regular expression to find files to tail in the base directory. In case recursivity is set to true, the regular expression will be used to match the path starting from the base directory (see additional details for examples).

Display Name

File(s) to Tail

Description

Path of the file to tail in case of single file mode. If using multifile mode, regular expression to find files to tail in the base directory. In case recursivity is set to true, the regular expression will be used to match the path starting from the base directory (see additional details for examples).

API Name

File to Tail

Expression Language Scope

Environment variables defined at JVM level and system properties

Sensitive

false

Required

true
Initial Start Position
When the Processor first begins to tail data, this property specifies where the Processor should begin reading data. Once data has been ingested from a file, the Processor will continue from the last point from which it has received data.
Display Name

Initial Start Position

Description

When the Processor first begins to tail data, this property specifies where the Processor should begin reading data. Once data has been ingested from a file, the Processor will continue from the last point from which it has received data.

API Name

Initial Start Position

Default Value

Beginning of File

Allowable Values
- Beginning of Time
- Beginning of File
- Current Time
Expression Language Scope

Not Supported

Sensitive

false

Required

true
Line Start Pattern
A Regular Expression to match against the start of a log line. If specified, any line that matches the expression, and any following lines, will be buffered until another line matches the Expression. In doing this, we can avoid splitting apart multi-line messages in the file. This assumes that the data is in UTF-8 format.
Display Name

Line Start Pattern

Description

A Regular Expression to match against the start of a log line. If specified, any line that matches the expression, and any following lines, will be buffered until another line matches the Expression. In doing this, we can avoid splitting apart multi-line messages in the file. This assumes that the data is in UTF-8 format.

API Name

Line Start Pattern

Expression Language Scope

Not Supported

Sensitive

false

Required

false

Dependencies
- Tailing mode is set to any of [Single file]
Max Buffer Size
When using the Line Start Pattern, there may be situations in which the data in the file being tailed never matches the Regular Expression. This would result in the processor buffering all data from the tailed file, which can quickly exhaust the heap. To avoid this, the Processor will buffer only up to this amount of data before flushing the buffer, even if it means ingesting partial data from the file.
Display Name

Max Buffer Size

Description

When using the Line Start Pattern, there may be situations in which the data in the file being tailed never matches the Regular Expression. This would result in the processor buffering all data from the tailed file, which can quickly exhaust the heap. To avoid this, the Processor will buffer only up to this amount of data before flushing the buffer, even if it means ingesting partial data from the file.

API Name

Max Buffer Size

Default Value

64 KB

Expression Language Scope

Not Supported

Sensitive

false

Required

true

Dependencies
- Line Start Pattern is set to any value specified
Post-Rollover Tail Period
When a file is rolled over, the processor will continue tailing the rolled over file until it has not been modified for this amount of time. This allows for another process to rollover a file, and then flush out any buffered data. Note that when this value is set, and the tailed file rolls over, the new file will not be tailed until the old file has not been modified for the configured amount of time. Additionally, when using this capability, in order to avoid data duplication, this period must be set longer than the Processor's Run Schedule, and the Processor must not be stopped after the file being tailed has been rolled over and before the data has been fully consumed. Otherwise, the data may be duplicated, as the entire file may be written out as the contents of a single FlowFile.

Display Name

Post-Rollover Tail Period

Description

When a file is rolled over, the processor will continue tailing the rolled over file until it has not been modified for this amount of time. This allows for another process to rollover a file, and then flush out any buffered data. Note that when this value is set, and the tailed file rolls over, the new file will not be tailed until the old file has not been modified for the configured amount of time. Additionally, when using this capability, in order to avoid data duplication, this period must be set longer than the Processor's Run Schedule, and the Processor must not be stopped after the file being tailed has been rolled over and before the data has been fully consumed. Otherwise, the data may be duplicated, as the entire file may be written out as the contents of a single FlowFile.

API Name

Post-Rollover Tail Period

Default Value

0 sec

Expression Language Scope

Not Supported

Sensitive

false

Required

false
Pre-Allocated Buffer Size
Sets the amount of memory that is pre-allocated for each tailed file.

Display Name

Pre-Allocated Buffer Size

Description

Sets the amount of memory that is pre-allocated for each tailed file.

API Name

pre-allocated-buffer-size

Default Value

65536 B

Expression Language Scope

Not Supported

Sensitive

false

Required

true
Reread when NUL encountered
If this option is set to 'true', when a NUL character is read, the processor will yield and try to read the same part again later. (Note: Yielding may delay the processing of other files tailed by this processor, not just the one with the NUL character.) The purpose of this flag is to allow users to handle cases where reading a file may return temporary NUL values. NFS for example may send file contents out of order. In this case the missing parts are temporarily replaced by NUL values. CAUTION! If the file contains legitimate NUL values, setting this flag causes this processor to get stuck indefinitely. For this reason users should refrain from using this feature if they can help it and try to avoid having the target file on a file system where reads are unreliable.
Display Name

Reread when NUL encountered

Description

If this option is set to 'true', when a NUL character is read, the processor will yield and try to read the same part again later. (Note: Yielding may delay the processing of other files tailed by this processor, not just the one with the NUL character.) The purpose of this flag is to allow users to handle cases where reading a file may return temporary NUL values. NFS for example may send file contents out of order. In this case the missing parts are temporarily replaced by NUL values. CAUTION! If the file contains legitimate NUL values, setting this flag causes this processor to get stuck indefinitely. For this reason users should refrain from using this feature if they can help it and try to avoid having the target file on a file system where reads are unreliable.

API Name

reread-on-nul

Default Value

false

Allowable Values
- true
- false
Expression Language Scope

Not Supported

Sensitive

false

Required

false
Rolling Filename Pattern
If the file to tail "rolls over" as would be the case with log files, this filename pattern will be used to identify files that have rolled over so that if NiFi is restarted, and the file has rolled over, it will be able to pick up where it left off. This pattern supports wildcard characters * and ?, it also supports the notation ${filename} to specify a pattern based on the name of the file (without extension), and will assume that the files that have rolled over live in the same directory as the file being tailed. The same glob pattern will be used for all files.

Display Name

Rolling Filename Pattern

Description

If the file to tail "rolls over" as would be the case with log files, this filename pattern will be used to identify files that have rolled over so that if NiFi is restarted, and the file has rolled over, it will be able to pick up where it left off. This pattern supports wildcard characters * and ?, it also supports the notation ${filename} to specify a pattern based on the name of the file (without extension), and will assume that the files that have rolled over live in the same directory as the file being tailed. The same glob pattern will be used for all files.

API Name

Rolling Filename Pattern

Expression Language Scope

Not Supported

Sensitive

false

Required

false
Base directory
Base directory used to look for files to tail. This property is required when using Multifile mode.

Display Name

Base directory

Description

Base directory used to look for files to tail. This property is required when using Multifile mode.

API Name

tail-base-directory

Expression Language Scope

Environment variables defined at JVM level and system properties

Sensitive

false

Required

false
Tailing mode
Mode to use: single file will tail only one file, multiple file will look for a list of file. In Multiple mode the Base directory is required.
Display Name

Tailing mode

Description

Mode to use: single file will tail only one file, multiple file will look for a list of file. In Multiple mode the Base directory is required.

API Name

tail-mode

Default Value

Single file

Allowable Values
- Single file
- Multiple files
Expression Language Scope

Not Supported

Sensitive

false

Required

true
Lookup frequency
Only used in Multiple files mode. It specifies the minimum duration the processor will wait before listing again the files to tail.

Display Name

Lookup frequency

Description

Only used in Multiple files mode. It specifies the minimum duration the processor will wait before listing again the files to tail.

API Name

tailfile-lookup-frequency

Default Value

10 minutes

Expression Language Scope

Not Supported

Sensitive

false

Required

false
Maximum age
Only used in Multiple files mode. It specifies the necessary minimum duration to consider that no new messages will be appended in a file regarding its last modification date. This should not be set too low to avoid duplication of data in case new messages are appended at a lower frequency.

Display Name

Maximum age

Description

Only used in Multiple files mode. It specifies the necessary minimum duration to consider that no new messages will be appended in a file regarding its last modification date. This should not be set too low to avoid duplication of data in case new messages are appended at a lower frequency.

API Name

tailfile-maximum-age

Default Value

24 hours

Expression Language Scope

Not Supported

Sensitive

false

Required

false
Recursive lookup
When using Multiple files mode, this property defines if files must be listed recursively or not in the base directory.
Display Name

Recursive lookup

Description

When using Multiple files mode, this property defines if files must be listed recursively or not in the base directory.

API Name

tailfile-recursive-lookup

Default Value

false

Allowable Values
- true
- false
Expression Language Scope

Not Supported

Sensitive

false

Required

true

State Management

Scopes	Description
CLUSTER, LOCAL	Stores state about where in the Tailed File it left off so that on restart it does not have to duplicate data. State is stored either local or clustered depend on the <File Location> property.

Restrictions

Required Permission	Explanation
read filesystem	Provides operator the ability to read from any file that NiFi has access to.

Relationships

Name	Description
success	All FlowFiles are routed to this Relationship.

Writes Attributes

Name	Description
tailfile.original.path	Path of the original file the flow file comes from.

TailFile 2.5.0

TailFile

Introduction

Modes

Rolling filename pattern

Descriptions for different modes and strategies

Handling Multi-Line Messages