Solving line breaks and other unstructured data when importing CSV files

I’m going to provide a sample text file and a sample Excel VBA macro file for you to test. You can find the download below.

Structured Data is Nice but Not Always Possible

It would nice if every system had the ability to export data in more structured formats, like XML or json, rather than CSV.

But, alas, that is not the case.

This is a real-world solution we’re using at a client right now.

Our client's system, PowerSchool, is a student/class management solution. We are building a transportation analysis/notification system.

It requires a daily - and possibly even multiple times a day - export of data. This data is read into an application that flags changes and based on certain conditions provides specialized reporting plus determines distances from a student’s home address to their school using Google Maps Distance Matrix API.

The challenge with exporting and then importing CSV files occurs with text fields. These fields can often contain commas, quotation marks, tabs, and other common delimiters for a CSV file. 

The result can be mis-aligned fields and even ghost records - data created when part of a record is seen as a new record due to line breaks and other anomalies.

I’m going to show a sampling of data that presents problems to our import process. Let’s assume we are exporting the following fields: student_id, fullname, parent, home_phone. 

The actual data is a bit different but for our example, this will work.

In the system, the parent & home_phone fields are freeform. Normally, the parent field would have a name only and home_phone a number only, what we discovered were examples were additional text was entered into either field.

For instance, under parent the data might be:

Mary Smith, aunt:
Home phone: 555-1212, “do not call after 6pm”
cell phone: 555-2212

Similar problems were found in the home_phone field. 

In these cases, the export might look like:

Original text export


Student_id, fullname, parent, home_phone
90990,Mark Jones, Fred Jones, 555-555-4455
90382,Blaine Bradshaw, Jane Bradshaw, 555-444-2222
90154,Sara Smith, Mary Smith, aunt: 
Home phone: 555-1212, “do not call after 6pm”, 555-310-3333
90218,Patricia Thompson, Craig Thompson, 555-765-6655

In the third record above, Sara Smith, you can see that the line breaks after “Mary Smith,aunt:” because that was what was entered in the “parent” field. 

This is problematic with csv files.

Why XML works

In XML, data has a superior (more well-defined) structure. Rows are contained in some type of row identifier and fields contained in a field identifier. For instance, the same row of data might be stored as:

<record>

   <student_id>90154</student_id>

   <fullname>Sara Smith</fullname>

   <parent>Mary Smith, aunt:

Home phone: 555-1212, “do not call after 6pm”</parent>

   <home_phone>555-310-3333</home_phone>

</record>

But I digress. While PowerSchool has the ability to export certain reports as XML, their scheduled export routine does not seem to have this ability.

Use a Non-standard Delimiter

Rather than use the standard delimiters for our CSV, we used a set of characters that were HIGHLY unlikely to ever appear in our file. In our case, we used:

[F]

Text file with non-standard delimiter


student_id[F]fullname[F]parent[F]home_phone[F]student_id
90990[F]Mark Jones[F]Fred Jones[F]555-555-4455[F]90990
90382[F]Blaine Bradshaw[F]Jane Bradshaw[F]555-444-2222[F]90382
90154[F]Sara Smith[F]Mary Smith, aunt:
Home phone: 555-1212, "do not call after 6pm"[F]555-310-3333[F]90154
90218[F]Patricia Thompson[F]Craig Thompson[F]555-765-6655[F]90218

 

So, each record ends with a line break but for those records where a line break is found inside a record, the [F] delimiter ensures we can determine the actual end of the record. I also include the student_id field at the end of the record. That field NEVER contains a line break and serves as a checksum of sorts.

Reading vs. Importing using the File Scripting Object

Additionally, because we are not only importing, but checking each record against what is currently in the database, I do not import the data but rather read the file and process each record. 

To do so, I use the File Scripting Object.

  • I read each line into a string variable and count of the number of times that “[F]” appears in the string. In the above example, it should appear 4 times (for 5 fields per record). If I do not find 4 instances of the “[F]” delimiter, I read the next line and concatenate it to the previous line and check again for the number of “[F]” delimiters in the string variable.
  • I repeat this until 4 instances are found in the concatenated string.
  • Once I’ve determined that I have a full record in my string variable, I perform a SPLIT function to create an array of the values. 
  • With that array, I then determine whether the student_id is in the current database. If not, I add it.
  • If it is already in the database, I compare critical fields and set certain flags based on changes.

Remember, I’m only showing a small sampling above of the data we are evaluating. We check for changes in string address, the status of a student’s IEP, the school they attend, and several other fields. Also, I’m doing this with an Access front-end and a SQL Express database but I’ve included an Excel macro sheet with two text file examples. 

Below is the VBA code I use to read through a file.

</pre>
Sub SelectFileandRun()
' run this from a button on the control sheet.
Dim strFile As String

With Application.FileDialog(msoFileDialogFilePicker)
.AllowMultiSelect = False
.Filters.Add "Import files", "*.csv; *.txt"
.Show
strFile = .SelectedItems.Item(1)
End With

If strFile <> "" Then
ParseTextFile (strFile)
End If

End Sub

Sub ParseTextFile(inFile)
'inFile is the text file you are reading.
Dim wb As Workbook, ws1 As Worksheet, ws2 As Worksheet
Dim fso As FileSystemObject
Dim stfile As TextStream

Set fso = New FileSystemObject
Set stfile = fso.OpenTextFile(inFile, ForReading)

'I do this just go get a count of total lines. It takes almost no time and allows me to provide status updates in MS Access using the command meter or in Excel using the status bar.
Do While stfile.AtEndOfStream <> True
stfile.SkipLine

Loop
intLines = stfile.Line

chc = MsgBox("You are about to lines " & intLines & " of student data." & vbCrLf & _
"Do you want to continue?", vbYesNo, "Student Data Import")
stfile.Close

'we'll use this later. We could make this a constant or even define it during an import.
strDelim = "[F]"

If chc <> 6 Then
MsgBox "Processed cancelled!"
Else
Set wb = ActiveWorkbook
Set ws1 = wb.Sheets("Control Sheet")
Set ws2 = wb.Sheets.Add(after:=ws1)
ws2.Activate 'go to that sheet

're-read file from start.
Set stfile = fso.OpenTextFile(inFile, ForReading)

' read first line and get headers.. I don't use them in my client process but will with a new sheet in Excel
aryHdr = Split(stfile.ReadLine, strDelim)
ws2.Range("A1").Select
For x = 0 To UBound(aryHdr)
ActiveCell.Offset(0, x).Value = aryHdr(x)
Next
'where I updated the meter in Access
'SysCmd acSysCmdInitMeter, "importing...", intLines

'process all the lines of data
intTotRows = 0
intLoop = 0 'use for doevents and updating status meter periodically
Do While stfile.AtEndOfStream <> True
intTotRows = intTotRows + 1
intLoop = intLoop + 1

If intLoop >= 50 Then 'this is to try to cause access/excel to repaint properly and avoid a not responsding message
DoEvents
intLoop = 0
'SysCmd acSysCmdUpdateMeter, intTotRows
End If

' read a line. If they number of fields is less than 22, read the next line as well and concatenate
strLine = stfile.ReadLine

' this calculates the number of times my delimiting string is found in a string
' it takes the total characters in the string and subracts the total characters in the string with the delimeter removed. It divides that by the length of the
' the delimter, giving you the numer of times that delimter appears in the string.
' for instance:
' 90990[F]Mark Jones[F]Fred Jones[F]555-555-4455[F]90990
' is 54 characters. Remove the [F] and it is 42 characters. 12 character difference.
' divide 12 by the length of "[F]" (3) and you get 4.
' So we will ensure that a record has 4 delimters before we process it.

intCtflds = (Len(strLine) - Len(Replace(strLine, strDelim, ""))) / Len(strDelim)
Do While intCtflds < 4
' if the line we are evaluating has fewer than 4 delimters, read the next line and check again. To this until you reach 4 delimeters.
strLine = strLine & vbCrLf & stfile.ReadLine
intCtflds = (Len(strLine) - Len(Replace(strLine, strDelim, ""))) / Len(strDelim)
Loop
'once you have 4 delimeters, create an array with the fields separated.
aryData = Split(strLine, strDelim)

'LEGEND
'0 - Student_number
'1 - fullname
'2 - parent
'3 - home_phone

'retrieve all data from import file record
strStudent_number = Trim(aryData(0))
strFullname = Trim(aryData(1))
strParent = Trim(aryData(2))
strHome_phone = Trim(aryData(3))

'in our database code, we check to see if the student ID exist in the database.
' if not, we create a new record with the retrieved values.
' If it does, we check the existing values againsg the new values, flag records that have changed, and update those values.
' I also write a log file with each change.

'in this example, I am simply writing the data to the new sheet created above.
ActiveCell.Offset(intTotRows, 0).Value = strStudent_number
ActiveCell.Offset(intTotRows, 1).Value = strFullname
ActiveCell.Offset(intTotRows, 2).Value = strParent
ActiveCell.Offset(intTotRows, 3).Value = strHome_phone
ActiveCell.Offset(intTotRows, 4).Value = strStudent_number

Loop
ws1.Range("B5").Value = inFile
ws1.Range("B6").Value = intTotRows
ws1.Range("B7").Value = Now()
'I normally update an import log table with the number of records imported, added, changed, and with no change.

'rsI("TotalRows") = intTotRows
'rsI("TotalAdds") = intTotadds
'rsI("TotalEdits") = intTotEdits
'rsI("TotalNC") = intTotNC
'rsI.Update

'instead I'll update the control sheet

' SysCmd acSysCmdRemoveMeter
MsgBox "Import complete!"

End If

Set stfile = Nothing
Set fso = Nothing

End Sub
<pre>

Make sure to include a reference to the Microsoft Scripting Runtime or, if using late binding:

As in:

Dim fso as Object
Set fso = CreateObject("Scripting.FileSytemObject")

Download


Download

If this was helpful to you:

  • Let me know. I’d love to hear from you.
  • Consider sharing on your social network of choice.
  • Find out about my other activities.
  • Refer or hire me as an automation consultant.
  • Most important: enjoy, learn and automate.

Thanks.
Matthew Moran

Posted in Consulting, Microsoft Access, Tips and Tools and tagged , , .

Leave a Reply

Your email address will not be published. Required fields are marked *