ListGoogleDrive

ListGoogleDrive 2.5.0

Bundle: org.apache.nifi | nifi-gcp-nar
Description: Performs a listing of concrete files (shortcuts are ignored) in a Google Drive folder. If the 'Record Writer' property is set, a single Output FlowFile is created, and each file in the listing is written as a single record to the output file. Otherwise, for each file in the listing, an individual FlowFile is created, the metadata being written as FlowFile attributes. This Processor is designed to run on Primary Node only in a cluster. If the primary node changes, the new Primary Node will pick up where the previous node left off without duplicating all of the data. Please see Additional Details to set up access to Google Drive.
Tags: drive, google, storage
Input Requirement: FORBIDDEN
Supports Sensitive Dynamic Properties: false

Additional Details for ListGoogleDrive 2.5.0
ListGoogleDrive

Accessing Google Drive from NiFi

This processor uses Google Cloud credentials for authentication to access Google Drive. The following steps are required to prepare the Google Cloud and Google Drive accounts for the processors:
1. Enable Google Drive API in Google Cloud
  - Follow instructions at https://developers.google.com/workspace/guides/enable-apis and search for ‘Google Drive API’.
2. Grant access to Google Drive folder
  - In Google Cloud Console navigate to IAM & Admin -> Service Accounts.
  - Take a note of the email of the service account you are going to use.
  - Navigate to the folder to be listed in Google Drive.
  - Right-click on the Folder -> Share.
  - Enter the service account email.
3. Find Folder ID
  - Navigate to the folder to be listed in Google Drive and enter it. The URL in your browser will include the ID at the end of the URL. For example, if the URL were https://drive.google.com/drive/folders/1trTraPVCnX5_TNwO8d9P_bz278xWOmGm, the Folder ID would be 1trTraPVCnX5_TNwO8d9P_bz278xWOmGm
4. Set Folder ID in ‘Folder ID’ property

Properties

Connect Timeout
Maximum wait time for connection to Google Drive service.

Display Name

Connect Timeout

Description

Maximum wait time for connection to Google Drive service.

API Name

connect-timeout

Default Value

20 sec

Expression Language Scope

Environment variables defined at JVM level and system properties

Sensitive

false

Required

true
Entity Tracking Initial Listing Target
Specify how initial listing should be handled. Used by 'Tracking Entities' strategy.
Display Name

Entity Tracking Initial Listing Target

Description

Specify how initial listing should be handled. Used by 'Tracking Entities' strategy.

API Name

et-initial-listing-target

Default Value

all

Allowable Values
- Tracking Time Window
- All Available
Expression Language Scope

Not Supported

Sensitive

false

Required

false

Dependencies
- Listing Strategy is set to any of [entities]
Entity Tracking State Cache
Listed entities are stored in the specified cache storage so that this processor can resume listing across NiFi restart or in case of primary node change. 'Tracking Entities' strategy require tracking information of all listed entities within the last 'Tracking Time Window'. To support large number of entities, the strategy uses DistributedMapCache instead of managed state. Cache key format is 'ListedEntities::{processorId}(::{nodeId})'. If it tracks per node listed entities, then the optional '::{nodeId}' part is added to manage state separately. E.g. cluster wide cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b', per node cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b::nifi-node3' The stored cache content is Gzipped JSON string. The cache key will be deleted when target listing configuration is changed. Used by 'Tracking Entities' strategy.
Display Name

Entity Tracking State Cache

Description

Listed entities are stored in the specified cache storage so that this processor can resume listing across NiFi restart or in case of primary node change. 'Tracking Entities' strategy require tracking information of all listed entities within the last 'Tracking Time Window'. To support large number of entities, the strategy uses DistributedMapCache instead of managed state. Cache key format is 'ListedEntities::{processorId}(::{nodeId})'. If it tracks per node listed entities, then the optional '::{nodeId}' part is added to manage state separately. E.g. cluster wide cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b', per node cache key = 'ListedEntities::8dda2321-0164-1000-50fa-3042fe7d6a7b::nifi-node3' The stored cache content is Gzipped JSON string. The cache key will be deleted when target listing configuration is changed. Used by 'Tracking Entities' strategy.

API Name

et-state-cache

Service Interface

org.apache.nifi.distributed.cache.client.DistributedMapCacheClient

Service Implementations

org.apache.nifi.hazelcast.services.cacheclient.HazelcastMapCacheClient

org.apache.nifi.distributed.cache.client.MapCacheClientService

org.apache.nifi.redis.service.RedisDistributedMapCacheClientService

org.apache.nifi.redis.service.SimpleRedisDistributedMapCacheClientService

Expression Language Scope

Not Supported

Sensitive

false

Required

false

Dependencies
- Listing Strategy is set to any of [entities]
Entity Tracking Time Window
Specify how long this processor should track already-listed entities. 'Tracking Entities' strategy can pick any entity whose timestamp is inside the specified time window. For example, if set to '30 minutes', any entity having timestamp in recent 30 minutes will be the listing target when this processor runs. A listed entity is considered 'new/updated' and a FlowFile is emitted if one of following condition meets: 1. does not exist in the already-listed entities, 2. has newer timestamp than the cached entity, 3. has different size than the cached entity. If a cached entity's timestamp becomes older than specified time window, that entity will be removed from the cached already-listed entities. Used by 'Tracking Entities' strategy.
Display Name

Entity Tracking Time Window

Description

Specify how long this processor should track already-listed entities. 'Tracking Entities' strategy can pick any entity whose timestamp is inside the specified time window. For example, if set to '30 minutes', any entity having timestamp in recent 30 minutes will be the listing target when this processor runs. A listed entity is considered 'new/updated' and a FlowFile is emitted if one of following condition meets: 1. does not exist in the already-listed entities, 2. has newer timestamp than the cached entity, 3. has different size than the cached entity. If a cached entity's timestamp becomes older than specified time window, that entity will be removed from the cached already-listed entities. Used by 'Tracking Entities' strategy.

API Name

et-time-window

Default Value

3 hours

Expression Language Scope

Environment variables defined at JVM level and system properties

Sensitive

false

Required

false

Dependencies
- Listing Strategy is set to any of [entities]
Folder ID
The ID of the folder from which to pull list of files. Please see Additional Details to set up access to Google Drive and obtain Folder ID. WARNING: Unauthorized access to the folder is treated as if the folder was empty. This results in the processor not creating outgoing FlowFiles. No additional error message is provided.

Display Name

Folder ID

Description

The ID of the folder from which to pull list of files. Please see Additional Details to set up access to Google Drive and obtain Folder ID. WARNING: Unauthorized access to the folder is treated as if the folder was empty. This results in the processor not creating outgoing FlowFiles. No additional error message is provided.

API Name

folder-id

Expression Language Scope

Environment variables defined at JVM level and system properties

Sensitive

false

Required

true
GCP Credentials Provider Service
The Controller Service used to obtain Google Cloud Platform credentials.

Display Name

GCP Credentials Provider Service

Description

The Controller Service used to obtain Google Cloud Platform credentials.

API Name

gcp-credentials-provider-service

Service Interface

org.apache.nifi.gcp.credentials.service.GCPCredentialsService

Service Implementations

org.apache.nifi.processors.gcp.credentials.service.GCPCredentialsControllerService

Expression Language Scope

Not Supported

Sensitive

false

Required

true
Listing Strategy
Specify how to determine new/updated entities. See each strategy descriptions for detail.
Display Name

Listing Strategy

Description

Specify how to determine new/updated entities. See each strategy descriptions for detail.

API Name

listing-strategy

Default Value

timestamps

Allowable Values
- Tracking Timestamps
- Tracking Entities
- Time Window
- No Tracking
Expression Language Scope

Not Supported

Sensitive

false

Required

true
Minimum File Age
The minimum age a file must be in order to be considered; any files younger than this will be ignored.

Display Name

Minimum File Age

Description

The minimum age a file must be in order to be considered; any files younger than this will be ignored.

API Name

min-age

Default Value

0 sec

Expression Language Scope

Not Supported

Sensitive

false

Required

true
Proxy Configuration Service
Specifies the Proxy Configuration Controller Service to proxy network requests.

Display Name

Proxy Configuration Service

Description

Specifies the Proxy Configuration Controller Service to proxy network requests.

API Name

proxy-configuration-service

Service Interface

org.apache.nifi.proxy.ProxyConfigurationService

Service Implementations

org.apache.nifi.proxy.StandardProxyConfigurationService

Expression Language Scope

Not Supported

Sensitive

false

Required

false
Read Timeout
Maximum wait time for response from Google Drive service.

Display Name

Read Timeout

Description

Maximum wait time for response from Google Drive service.

API Name

read-timeout

Default Value

60 sec

Expression Language Scope

Environment variables defined at JVM level and system properties

Sensitive

false

Required

true
Record Writer
Specifies the Record Writer to use for creating the listing. If not specified, one FlowFile will be created for each entity that is listed. If the Record Writer is specified, all entities will be written to a single FlowFile instead of adding attributes to individual FlowFiles.

Display Name

Record Writer

Description

Specifies the Record Writer to use for creating the listing. If not specified, one FlowFile will be created for each entity that is listed. If the Record Writer is specified, all entities will be written to a single FlowFile instead of adding attributes to individual FlowFiles.

API Name

record-writer

Service Interface

org.apache.nifi.serialization.RecordSetWriterFactory

Service Implementations

org.apache.nifi.avro.AvroRecordSetWriter

org.apache.nifi.csv.CSVRecordSetWriter

org.apache.nifi.text.FreeFormTextRecordSetWriter

org.apache.nifi.json.JsonRecordSetWriter

org.apache.nifi.lookup.RecordSetWriterLookup

org.apache.nifi.record.script.ScriptedRecordSetWriter

org.apache.nifi.xml.XMLRecordSetWriter

Expression Language Scope

Not Supported

Sensitive

false

Required

false
Search Recursively
When 'true', will include list of files from concrete sub-folders (ignores shortcuts). Otherwise, will return only files that have the defined 'Folder ID' as their parent directly. WARNING: The listing may fail if there are too many sub-folders (500+).
Display Name

Search Recursively

Description

When 'true', will include list of files from concrete sub-folders (ignores shortcuts). Otherwise, will return only files that have the defined 'Folder ID' as their parent directly. WARNING: The listing may fail if there are too many sub-folders (500+).

API Name

recursive-search

Default Value

true

Allowable Values
- true
- false
Expression Language Scope

Not Supported

Sensitive

false

Required

true

State Management

Scopes	Description
CLUSTER	The processor stores necessary data to be able to keep track what files have been listed already. What exactly needs to be stored depends on the 'Listing Strategy'. State is stored across the cluster so that this Processor can be run on Primary Node only and if a new Primary Node is selected, the new node can pick up where the previous node left off, without duplicating the data.

Relationships

Name	Description
success	All FlowFiles that are received are routed to success

Writes Attributes

Name	Description
drive.id	The id of the file
filename	The name of the file
mime.type	The MIME type of the file
drive.size	The size of the file. Set to 0 when the file size is not available (e.g. externally stored files).
drive.size.available	Indicates if the file size is known / available
drive.timestamp	The last modified time or created time (whichever is greater) of the file. The reason for this is that the original modified date of a file is preserved when uploaded to Google Drive. 'Created time' takes the time when the upload occurs. However uploaded files can still be modified later.
drive.created.time	The file's creation time
drive.modified.time	The file's last modification time
drive.path	The path of the file's directory from the base directory. The path contains the folder names in URL encoded form because Google Drive allows special characters in file names, including '/' (slash) and '\' (backslash). The URL encoded folder names are separated by '/' in the path.
drive.owner	The owner of the file
drive.last.modifying.user	The last modifying user of the file
drive.web.view.link	Web view link to the file
drive.web.content.link	Web content link to the file
drive.parent.folder.id	The id of the file's parent folder
drive.parent.folder.name	The name of the file's parent folder
drive.listed.folder.id	The id of the base folder that was listed
drive.listed.folder.name	The name of the base folder that was listed
drive.shared.drive.id	The id of the shared drive (if the file is located on a shared drive)
drive.shared.drive.name	The name of the shared drive (if the file is located on a shared drive)

ListGoogleDrive 2.5.0

ListGoogleDrive

Accessing Google Drive from NiFi