Importing data from flat files
Tips and tricks around importing data from flat text files from disk or ftp, using Flow as the extraction engine
Last updated
Tips and tricks around importing data from flat text files from disk or ftp, using Flow as the extraction engine
Last updated
In a situation where the only source to the data you need exists in flat text files (or XML-files) and you need to import it to e.g. a database, it can be a challenge to figure out what tool to use and how to use it. There are many options out there, some whose sole purpose is more or less just that (Microsoft SSIS / Data factory etc.), that are more powerful for just this purpose, albeit there could be situations where Flow is not such a bad option. For example, if you are using Flow as the extraction tool from many data sources you can still use the same tool for your text/xml files. The simplicity with Flow to modify the data and/or logic during the import combined with that you and your colleagues do not need to learn and monitor another tool could be some of the reasons to use Flow. If you end up going with Flow, this page give useful tips.
There are some prerequisites for the file itself for using Flow as your extraction engine:
Flow can only handle simple flat files, meaning not .xls/.xlsx or other complex files, they need to be plain text files. Different delimiter characters (the character that splits the columns, e.g., Tab, Comma, Semicolon. Etc.) and different code pages are though supported in Flow.
For text files the number of columns need to always be the same. As we will look at below you can use your own headers, but the number of columns needs to always be the same.
If any of the above steps are not fulfilled, consider using another tool to import your data.
You also need to consider how the file will be uploaded to Flow. If it is from a manual user step in a user Flow no extra license will be needed. However if the file is on an ftp/sftp-server you need a separate license for that connector, or if its on file you will need the File Connector license.
Create a user step with the File Gallery control, asking for a file to upload. It is always good to have a simple user flow for troubleshooting, so creating this Flow next to your automated more complex Flow is a good idea. To follow along in the script step below, if you please, create a plain text file with three columns separated by semi colon.
The next step will be our CSVFillTable Flow Script step that we will take a closer look at.
If you are not familiar with these functions, please search for them in our help section where the latest functionality will be explained.
We will be using CSVFill (which is a Flow script command) to delimit the data.
A nifty feature of flow script to use in conjunction with CSVFill is how to handle files without column headers, or if they consist of illegal characters (including spaces). For that, create your own string that you concatenate with your data using the &-sign like this (for simplicity we are pretending our file consists of three columns, KeyCol, ExprCol1 and ExprCol2):
We are creating our own three headers here, adding a Carriage Return (chr(13)) and then adding our file data that is in our csv-variable (or what we have called it in our connector that fetches the file). Even if the file has simple column header names it could be a good idea to just use your own anyway (if you will copy the flow and we are not as lucky with the next file, or something changes in the import file)
As described in the help section, CSVFill also needs a prototype table::
and then assigning the Result variable to the result of the CSV-Fill function:
If the file has headers in it you do not want to import them (if it doesn’t this step can be ignored, but again could be good to just leave 0 at the end so it is there if you are making a copy of this Flow as a template for another file import, which has headers):
So our entire flow script step will look something like this, returning our data in a table variable:
In a real world scenario there could be quite a few columns to import so for readability you may row break the StringWithHeaders and Prototype. This way it is much easier to see if a column is missing between the two or if an extra delimiter character snuck it's way in. Like this example:
What if the columns in the text file are separated by a tab as delimiter? Trying to just enter a tab between the columns will not work. Instead try using another character and then replacing it with chr(9) (which is Tab):
Continue working with the user Flow created above to add some steps before the CSVFill script step. Although the goal is to create a machine Flow that runs on a schedule, it is good to have the user Flow for troubleshooting, and it is easier when building up the logic.
In the examples below we are assuming the file is on a path on the server (UNC-paths on fileservers are supported), but the logic is similar if the source is an FTP server.
We will start by setting some variables. Paths: were the file is to be fetched and where to be put after completion. Search Pattern: if there are multiple different files ending up in the same directory we can use wildcards to find only the files I want to deal with in this flow. Having them as variables like this makes it easy to duplicate the flow for the next file type, or create a more complex loop going over them in the same flow.
So let us start by checking if there are any files in our Path using the File connector (same logic would apply to FTP as stated above)
And right after the step to fetch the file(s), a decision step to exit the flow if there are no file (and optionally update the log about it, and/or send an email).
If we have found any files that matches our search pattern, loop over them one by one
Lets create some variables extracted from the file (singular file, since we are in the loop):
· FileName · FileDate (sometimes the name of the file contains the creation date, here fetching it with a mid function. If there is no date in filename, create this variable after the next step where we will check for file modification meta data). · FullPath, the path + the filename · FullPathProcessed
Using the file system connector again It is possible to get some other metadata about the file, like creation date and last updated. Checking this and storing them in a FileMods record variable, and also store the FileDate in the database for each row fetched from this file for troubleshooting and duplication avoidance.
If the text file does not have a unique name (a date when it was created or some other identifier) we can use the creation time as part of the name for the destination file after moving it later. That way there are copies of the imported files in my processed-folder (that I can match with the records in the database since I am saving this variable here as well). Assigning FullPathProcessed then has to be done right after the Get File Modification step
Text string variable can contain bad characters for our interpretation, for instance line breaks or double quotations marks, that will break the insert into table step later if not taken care of. It can be dealt with in different ways but here I have chosen to do a replace step before the CSVFill step. The example below is replacing double quotation (") with a bland character (""). Chr(xx) can be used since just typing double quotation will not work with the flow script. Chr(34) means a double quotation.
This is where we can put our knowledge about CSVFill, as described above and in the help section, to use. Create a Flow Script step that will interpret your file variable and split it into columns and a table variable
For the user Flow (for building and testing), just create a user step after the step above with a data grid to present your data. If that works, create a machine Flow with almost the same name and copy over the non user steps we have created thus far. So from now on we will be working in our machine Flow, with the goal of having it automatically running on a schedule.
But the it likely also in real life wants to end up somewhere, e.g. an SQL Database. When inserting into the destination table it is a good practice to have a unique key column to work with so that we know if the row has been entered before.
If that does not exists we have to work with the files creation date or the file name in combination with other columns in the file to know if the row has been entered before. It is very functional to be able to re-run files that might have gone wrong or for other reasons having to worry about duplicates.
Apart from the column headers in the file I add SourceFileCreationDT, SourceFileName and RowCreatedDt as columns in all my import tables (inserting the values from the FIle Modification step). RowCreatedDt is set to default to the time at insert (getdate() in sql server) to automatically track when the line is added.
It can differ from time to time if you want to import the values all the way in to where they are supposed to end up, or if you work with an intermediate import table. If using an import table it may feel attractive to use a generic datatype that accepts all values (like the nvarchar datatype in SQL) on all columns to avoid issues at import and work with conversions etc. once the data is in a database. The downside of that is also that anything can get imported (perhaps a column has been added to the text file adjusting all rows by one column per row) so it might be better to have the job fail than import the wrong data by trying to have some specific date, numeric and text datatypes. So recommendation is to use some date/datetime and or int/decimal to capture if the column alignment is off.
How to actually create the import tables is easy if you just have a few columns. But if there are many columns in the file the SQL import tool could be good to just generate the table and import the file for the first time:
Then when the table is created with the right column datatypes and names you can drag the columns folder out to the query part of SQL Mgmt Studio, and from there with some copy and pasting create three lists of your columns comma separated, semicolon separated and with an @-sign for the variables.
Make sure there is no spaces between the middle chunc, where the semicolons are, as the spaces will be included as the column names.
You will need the two first ones in the CSVFill script step and the first and the last for the SQL insert script.
Pro Tip: Highlighting one or many rows before pressing ctrl+h (search and replace) and selecting "Selection" will only change within the row(s) highlighted.
(An alternative to create the table wiht the columns if we already have them in a list could be to use the Transpose function in Excel)
And normal Excel functions to fill commas down under the first col, and managing the data types for the columns. Doing a couple of files will perfect the process that fits you the best, these are some options.
Sidenote: If you are migrating data from Oracle into a MS SQL database you can script the tables from Oracle, but remember to change some of the datatypes. For example Replace these words in the script (but be sure to just change data types, not column names): "Number(" -> "Decimal(" (Be sure to have the parenthesis there for this one) "Number" -> "Bigint" (As Number in Oracle can be larger than just Int, BigInt is safer) "Date" -> "Datetime2(0)" "Varchar2(" -> "Nvarchar("
Another piece of helping code in sql could be to generate an update statement from the sysobjects tables in SQL (needs permissions on the master database). Could look like this (the result will have to be modified slightly, but if you have many columns, and named your variables the same as the columns it could help you)
And similarly this for the null handling:
To be able to monitor the import jobs, let us use a log table. Note the scope_identity to get the logId as an output variable in the step at the start of the Flow to log that the job has started
Then at the end of the job, create an update step that logs the time difference between the start time and end time, and update the status for the row to 'Ok'
Kanban is useful too see that all jobs have gone ok, as you get the count of jobs in the top of the lanes to catch if a job did not start at all. By sorting them by status you can have failed jobs on the top, giving them a different color.
Scheduler portlet is good to make sure the jobs are not running into each other, and it gives a visualization over what is going on over the day
And then perhaps a Table portlet to be able to sort and search in the log table
And of course, for important jobs an email could be sent out with the email connector in the flows containing information about failed and / or successful jobs.