I’m going to provide a sample text file and a sample Excel VBA macro file for you to test. You can find the download below.
Structured Data is Nice but Not Always Possible
It would nice if every system had the ability to export data in more structured formats, like XML or json, rather than CSV.
But, alas, that is not the case.
This is a real-world solution we’re using at a client right now.
Our client's system, PowerSchool, is a student/class management solution. We are building a transportation analysis/notification system.
It requires a daily - and possibly even multiple times a day - export of data. This data is read into an application that flags changes and based on certain conditions provides specialized reporting plus determines distances from a student’s home address to their school using Google Maps Distance Matrix API.
The challenge with exporting and then importing CSV files occurs with text fields. These fields can often contain commas, quotation marks, tabs, and other common delimiters for a CSV file.
The result can be mis-aligned fields and even ghost records - data created when part of a record is seen as a new record due to line breaks and other anomalies.
I’m going to show a sampling of data that presents problems to our import process. Let’s assume we are exporting the following fields: student_id, fullname, parent, home_phone.
The actual data is a bit different but for our example, this will work.
In the system, the parent & home_phone fields are freeform. Normally, the parent field would have a name only and home_phone a number only, what we discovered were examples were additional text was entered into either field.
For instance, under parent the data might be:
Mary Smith, aunt:
Home phone: 555-1212, “do not call after 6pm”
cell phone: 555-2212
Similar problems were found in the home_phone field.
In these cases, the export might look like:
Original text export
Student_id, fullname, parent, home_phone 90990,Mark Jones, Fred Jones, 555-555-4455 90382,Blaine Bradshaw, Jane Bradshaw, 555-444-2222 90154,Sara Smith, Mary Smith, aunt: Home phone: 555-1212, “do not call after 6pm”, 555-310-3333 90218,Patricia Thompson, Craig Thompson, 555-765-6655
In the third record above, Sara Smith, you can see that the line breaks after “Mary Smith,aunt:” because that was what was entered in the “parent” field.
This is problematic with csv files.
Why XML works
In XML, data has a superior (more well-defined) structure. Rows are contained in some type of row identifier and fields contained in a field identifier. For instance, the same row of data might be stored as:
<parent>Mary Smith, aunt:
Home phone: 555-1212, “do not call after 6pm”</parent>
But I digress. While PowerSchool has the ability to export certain reports as XML, their scheduled export routine does not seem to have this ability.
Use a Non-standard Delimiter
Rather than use the standard delimiters for our CSV, we used a set of characters that were HIGHLY unlikely to ever appear in our file. In our case, we used:
Text file with non-standard delimiter
student_id[F]fullname[F]parent[F]home_phone[F]student_id 90990[F]Mark Jones[F]Fred Jones[F]555-555-4455[F]90990 90382[F]Blaine Bradshaw[F]Jane Bradshaw[F]555-444-2222[F]90382 90154[F]Sara Smith[F]Mary Smith, aunt: Home phone: 555-1212, "do not call after 6pm"[F]555-310-3333[F]90154 90218[F]Patricia Thompson[F]Craig Thompson[F]555-765-6655[F]90218
So, each record ends with a line break but for those records where a line break is found inside a record, the [F] delimiter ensures we can determine the actual end of the record. I also include the student_id field at the end of the record. That field NEVER contains a line break and serves as a checksum of sorts.
Reading vs. Importing using the File Scripting Object
Additionally, because we are not only importing, but checking each record against what is currently in the database, I do not import the data but rather read the file and process each record.
To do so, I use the File Scripting Object.
- I read each line into a string variable and count of the number of times that “[F]” appears in the string. In the above example, it should appear 4 times (for 5 fields per record). If I do not find 4 instances of the “[F]” delimiter, I read the next line and concatenate it to the previous line and check again for the number of “[F]” delimiters in the string variable.
- I repeat this until 4 instances are found in the concatenated string.
- Once I’ve determined that I have a full record in my string variable, I perform a SPLIT function to create an array of the values.
- With that array, I then determine whether the student_id is in the current database. If not, I add it.
- If it is already in the database, I compare critical fields and set certain flags based on changes.
Remember, I’m only showing a small sampling above of the data we are evaluating. We check for changes in string address, the status of a student’s IEP, the school they attend, and several other fields. Also, I’m doing this with an Access front-end and a SQL Express database but I’ve included an Excel macro sheet with two text file examples.
Below is the VBA code I use to read through a file.
</pre> Sub SelectFileandRun() ' run this from a button on the control sheet. Dim strFile As String With Application.FileDialog(msoFileDialogFilePicker) .AllowMultiSelect = False .Filters.Add "Import files", "*.csv; *.txt" .Show strFile = .SelectedItems.Item(1) End With If strFile <> "" Then ParseTextFile (strFile) End If End Sub Sub ParseTextFile(inFile) 'inFile is the text file you are reading. Dim wb As Workbook, ws1 As Worksheet, ws2 As Worksheet Dim fso As FileSystemObject Dim stfile As TextStream Set fso = New FileSystemObject Set stfile = fso.OpenTextFile(inFile, ForReading) 'I do this just go get a count of total lines. It takes almost no time and allows me to provide status updates in MS Access using the command meter or in Excel using the status bar. Do While stfile.AtEndOfStream <> True stfile.SkipLine Loop intLines = stfile.Line chc = MsgBox("You are about to lines " & intLines & " of student data." & vbCrLf & _ "Do you want to continue?", vbYesNo, "Student Data Import") stfile.Close 'we'll use this later. We could make this a constant or even define it during an import. strDelim = "[F]" If chc <> 6 Then MsgBox "Processed cancelled!" Else Set wb = ActiveWorkbook Set ws1 = wb.Sheets("Control Sheet") Set ws2 = wb.Sheets.Add(after:=ws1) ws2.Activate 'go to that sheet 're-read file from start. Set stfile = fso.OpenTextFile(inFile, ForReading) ' read first line and get headers.. I don't use them in my client process but will with a new sheet in Excel aryHdr = Split(stfile.ReadLine, strDelim) ws2.Range("A1").Select For x = 0 To UBound(aryHdr) ActiveCell.Offset(0, x).Value = aryHdr(x) Next 'where I updated the meter in Access 'SysCmd acSysCmdInitMeter, "importing...", intLines 'process all the lines of data intTotRows = 0 intLoop = 0 'use for doevents and updating status meter periodically Do While stfile.AtEndOfStream <> True intTotRows = intTotRows + 1 intLoop = intLoop + 1 If intLoop >= 50 Then 'this is to try to cause access/excel to repaint properly and avoid a not responsding message DoEvents intLoop = 0 'SysCmd acSysCmdUpdateMeter, intTotRows End If ' read a line. If they number of fields is less than 22, read the next line as well and concatenate strLine = stfile.ReadLine ' this calculates the number of times my delimiting string is found in a string ' it takes the total characters in the string and subracts the total characters in the string with the delimeter removed. It divides that by the length of the ' the delimter, giving you the numer of times that delimter appears in the string. ' for instance: ' 90990[F]Mark Jones[F]Fred Jones[F]555-555-4455[F]90990 ' is 54 characters. Remove the [F] and it is 42 characters. 12 character difference. ' divide 12 by the length of "[F]" (3) and you get 4. ' So we will ensure that a record has 4 delimters before we process it. intCtflds = (Len(strLine) - Len(Replace(strLine, strDelim, ""))) / Len(strDelim) Do While intCtflds < 4 ' if the line we are evaluating has fewer than 4 delimters, read the next line and check again. To this until you reach 4 delimeters. strLine = strLine & vbCrLf & stfile.ReadLine intCtflds = (Len(strLine) - Len(Replace(strLine, strDelim, ""))) / Len(strDelim) Loop 'once you have 4 delimeters, create an array with the fields separated. aryData = Split(strLine, strDelim) 'LEGEND '0 - Student_number '1 - fullname '2 - parent '3 - home_phone 'retrieve all data from import file record strStudent_number = Trim(aryData(0)) strFullname = Trim(aryData(1)) strParent = Trim(aryData(2)) strHome_phone = Trim(aryData(3)) 'in our database code, we check to see if the student ID exist in the database. ' if not, we create a new record with the retrieved values. ' If it does, we check the existing values againsg the new values, flag records that have changed, and update those values. ' I also write a log file with each change. 'in this example, I am simply writing the data to the new sheet created above. ActiveCell.Offset(intTotRows, 0).Value = strStudent_number ActiveCell.Offset(intTotRows, 1).Value = strFullname ActiveCell.Offset(intTotRows, 2).Value = strParent ActiveCell.Offset(intTotRows, 3).Value = strHome_phone ActiveCell.Offset(intTotRows, 4).Value = strStudent_number Loop ws1.Range("B5").Value = inFile ws1.Range("B6").Value = intTotRows ws1.Range("B7").Value = Now() 'I normally update an import log table with the number of records imported, added, changed, and with no change. 'rsI("TotalRows") = intTotRows 'rsI("TotalAdds") = intTotadds 'rsI("TotalEdits") = intTotEdits 'rsI("TotalNC") = intTotNC 'rsI.Update 'instead I'll update the control sheet ' SysCmd acSysCmdRemoveMeter MsgBox "Import complete!" End If Set stfile = Nothing Set fso = Nothing End Sub <pre>
Make sure to include a reference to the Microsoft Scripting Runtime or, if using late binding:
Dim fso as Object
Set fso = CreateObject("Scripting.FileSytemObject")
If this was helpful to you:
- Let me know. I’d love to hear from you.
- Consider sharing on your social network of choice.
- Find out about my other activities.
- Refer or hire me as an automation consultant.
- Most important: enjoy, learn and automate.