Skip to content

Commit 51e3af6

Browse files
committed
Improved CSVsubsetSplit method
Additionally, a bug with the delimiter sniffer that prevented dialects from being properly determined has been fixed.
1 parent e9f74af commit 51e3af6

12 files changed

+420
-12
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -421,7 +421,7 @@ Visit [this site](https://ws-garcia.github.io/VBA-CSV-interface/limitations/csv_
421421

422422
## Licence
423423

424-
Copyright (C) 2020-2023 [W. García](https://github.com/ws-garcia/).
424+
Copyright (C) 2020-2024 [W. García](https://github.com/ws-garcia/).
425425

426426
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
427427

docs/api/methods/csvsubsetsplit.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ Splits the CSV data into a set of files in which each piece has a related portio
1515

1616
## Syntax
1717

18-
*expression*.`CSVsubsetSplit`*(filePath, \[subsetColumns:= 1\], \[headers:= True\], \[repeatHeaders:= True\], \[streamSize:= 20])*
18+
*expression*.`CSVsubsetSplit`*(filePath, \[subsetColumns:= 1\], \[headers:= True\], \[repeatHeaders:= True\], \[streamSize:= 20\], \[oConfig:=Nothing\])*
1919

2020
### Parameters
2121

@@ -47,6 +47,10 @@ Splits the CSV data into a set of files in which each piece has a related portio
4747
<td style="text-align: left;"><em>streamSize</em></td>
4848
<td style="text-align: left;">Optional. Identifier specifying a <code>Long</code> Type variable representing the buffer size factor used to read the target CSV file.</td>
4949
</tr>
50+
<tr>
51+
<td style="text-align: left;"><em>oConfig</em></td>
52+
<td style="text-align: left;">Optional. Identifier specifying a <code>CSVparserConfig</code> Object variable holding all the configurations to parse the CSV file.</td>
53+
</tr>
5054
</tbody>
5155
</table>
5256

@@ -60,6 +64,11 @@ Splits the CSV data into a set of files in which each piece has a related portio
6064

6165
The `CSVsubsetSplit` method will create a file for each different value (data grouping) in the fields at the *subsetColumns* position, then all related data is appended to the respective file. Use the *headers* parameter to include a header record in each new CSV file. The *subsetColumns* parameter can be a single value or an array of `Long` values. When the CSV file has a header record and the user sets the *header* parameter to `False`, the header row is saved in a separate file and the rest of CSV files will have no header record. The user can control when to include the headers by using the *repeatHeaders* parameter.
6266

67+
>⚠️**Caution**
68+
>{: .text-grey-lt-000 .bg-green-000 }
69+
>The user shall verify that when subdividing a CSV file using a text field/column there are no records with special characters not supported by the file system of the operating system.
70+
{: .text-grey-dk-300 .bg-yellow-000 }
71+
6372
>📝**Note**
6473
>{: .text-grey-lt-000 .bg-green-000 }
6574
>The result subsets will be saved in a folder named [\*-WorkDir], where (\*) denotes the name of the source CSV file.

docs/home/getting_started.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -209,7 +209,7 @@ The images below shows the overall performance for the imports operations from t
209209
- The CSV syntax slow-down the performance. When the number of escaped fields are increased, the performance decrease.
210210

211211
## Licence
212-
Copyright (C) 2020-2023 [W. García](https://github.com/ws-garcia/).
212+
Copyright (C) 2020-2024 [W. García](https://github.com/ws-garcia/).
213213

214214
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
215215

src/Access_version.zip

170 Bytes
Binary file not shown.

src/All_Host_version.zip

125 Bytes
Binary file not shown.

src/CSVSniffer.cls

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -486,19 +486,20 @@ End Function
486486
Private Function RecordScore(ByRef strArray As Variant) As Double
487487
Dim L0 As Long
488488
Dim tmpSUM As Double
489-
Dim FielCount As Long
489+
Dim SumLEN As Double
490+
Dim FieldsCount As Long
490491

491-
FielCount = 1 + UBound(strArray) - LBound(strArray)
492+
FieldsCount = 1 + UBound(strArray) - LBound(strArray)
492493
tmpSUM = 0
493494
For L0 = LBound(strArray) To UBound(strArray)
494495
Select Case DetectDataType(strArray(L0))
495496
Case FieldDataType.Known
496497
tmpSUM = tmpSUM + 100
497498
Case Else
498-
tmpSUM = tmpSUM + 20
499+
tmpSUM = tmpSUM + 0.1 '20
499500
End Select
500501
Next L0
501-
RecordScore = tmpSUM / FielCount
502+
RecordScore = (tmpSUM / FieldsCount)
502503
End Function
503504
''' <summary>
504505
''' Calculates a factor for table scoring based in the standard
@@ -540,7 +541,7 @@ Public Function TableScore(ByRef ArrayList As CSVArrayList) As Double
540541
For L0 = 0 To ArrayList.count - 1
541542
SumRecScores = SumRecScores + RecordScore(ArrayList(L0))
542543
Next L0
543-
TableScore = RecordsConsistencyFactor(ArrayList) * SumRecScores
544+
TableScore = RecordsConsistencyFactor(ArrayList) * SumRecScores / ArrayList.count
544545
End If
545546
End If
546547
End Function

src/CSVinterface.cls

Lines changed: 21 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,8 @@ Private Const CHR_BACKSLASH As String = "\"
2929
Private Const CHR_DOUBLE_QUOTES As String = """"
3030
Private Const CHR_TILDE As String = "~"
3131
Private Const CHR_CARET As String = "^"
32+
Private Const CHR_LSQRB As String = "{"
33+
Private Const CHR_RSQRB As String = "}"
3234
'////////////////////////////////////////////////////////////////////////////////////////////
3335
'#
3436
'////////////////////////////////////////////////////////////////////////////////////////////
@@ -300,7 +302,8 @@ Public Function CSVsubsetSplit(filePath As String, _
300302
Optional subsetColumns As Variant = 1, _
301303
Optional Headers As Boolean = True, _
302304
Optional repeatHeaders As Boolean = True, _
303-
Optional streamSize As Long = 20) As Collection
305+
Optional streamSize As Long = 20, _
306+
Optional oConfig As CSVparserConfig = Nothing) As Collection
304307
Dim CreatedFiles As Collection
305308
Dim CSVhead As Variant
306309
Dim CSVreader As CSVinterface
@@ -322,6 +325,11 @@ Public Function CSVsubsetSplit(filePath As String, _
322325

323326
Set CreatedFiles = New Collection
324327
Set CSVreader = New CSVinterface
328+
If Not oConfig Is Nothing Then
329+
Set CSVreader.parseConfig = oConfig.CopyConfig
330+
Else
331+
Set CSVreader.parseConfig = Me.parseConfig.CopyConfig
332+
End If
325333
Set CSVwriter = New CSVinterface
326334
Set readerConf = CSVreader.parseConfig
327335
Set ExportSubSet = New CSVArrayList
@@ -348,6 +356,7 @@ Public Function CSVsubsetSplit(filePath As String, _
348356
'Sequential reading from file
349357
With CSVstream
350358
.endStreamOnLineBreak = True
359+
.unifiedLFOutput = readerConf.multiEndOfLineCSV
351360
.OpenStream filePath
352361
.utf8EncodedFile = readerConf.utf8EncodedFile
353362
.bufferSize = streamSize
@@ -3162,9 +3171,11 @@ Private Function SniffInString(ByRef confObject As CSVparserConfig, _
31623171
'@--------------------------------------------------------------------------------
31633172
'Save results with keys
31643173
If InStrB(1, TmpCSVstr, GetQuoteChar(QuoteChar(k))) Then
3165-
ScoreArray.AddIndexedItem DialectToString(.dialect), GuesserHelper.TableScore(ImportedTable)
3174+
ScoreArray.AddIndexedItem AppendIndexesToKey(DialectToString(.dialect), i, j), _
3175+
GuesserHelper.TableScore(ImportedTable)
31663176
Else
3167-
ScoreArray.AddIndexedItem DialectToString(.dialect) & CHR_CARET, GuesserHelper.TableScore(ImportedTable) / 2
3177+
ScoreArray.AddIndexedItem AppendIndexesToKey(DialectToString(.dialect) & CHR_CARET, i, j), _
3178+
GuesserHelper.TableScore(ImportedTable) '/ 2
31683179
End If
31693180
Next k
31703181
Next j
@@ -3173,7 +3184,7 @@ Private Function SniffInString(ByRef confObject As CSVparserConfig, _
31733184
With ScoreArray
31743185
'@--------------------------------------------------------------------------------
31753186
'Choose the maximum score
3176-
tmpResult = .keys()(MaxIndexVal(.indexedItems))
3187+
tmpResult = RemoveIndexesFromKey(.keys()(MaxIndexVal(.indexedItems)))
31773188
'@--------------------------------------------------------------------------------
31783189
'Returns
31793190
Set SniffInString = StringToDialect(tmpResult)
@@ -3218,6 +3229,12 @@ Private Function DialectToString(ByRef dialectObj As CSVdialect) As String
32183229
End With
32193230
DialectToString = Join$(tmpResult, "ii")
32203231
End Function
3232+
Private Function AppendIndexesToKey(ByRef aKey As String, idx1 As Long, idx2 As Long) As String
3233+
AppendIndexesToKey = CHR_LSQRB & idx1 & idx2 & CHR_RSQRB & aKey
3234+
End Function
3235+
Private Function RemoveIndexesFromKey(ByRef aKey As Variant) As String
3236+
RemoveIndexesFromKey = MidB(aKey, InStrB(1, aKey, CHR_RSQRB) + 2)
3237+
End Function
32213238
Private Function StringToDialect(ByRef dialectString As String) As CSVdialect
32223239
Dim tmpArr() As String
32233240
Dim idx As Long
-85.2 KB
Binary file not shown.
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
=== Delimiters guessing test ===
2+
+ Mixed comma and semicolon
3+
+ File with multi-line field
4+
+ Optional quoted fields
5+
+ Mixed comma and semicolon - file B
6+
+ Geometric CSV
7+
+ Table embedded in the last record
8+
+ Table embedded in the second record
9+
+ Multiple commas in fields
10+
+ Uncommon char as field delimiter
11+
+ Wrong delimiters have been added to guessing operation
12+
+ FEC data - [clevercsv issue #15]
13+
+ Mixed comma and colon - [clevercsv issue #35]
14+
+ Json data type - [clevercsv issue #37]
15+
+ Undefined field delimiter
16+
X Rainbow CSV [issue #92]
17+
Expected: ([,] & [2])Actual: ([|] & [2])
18+
+ Pipe character is more frequent than the comma
19+
+ Pipe character is more frequent than the semicolon
20+
+ Short pipe separated table embedded
21+
= FAIL (1 of 18 failed) = 18/1/2024 11:28:32 p.�m. =
22+
Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
=== StreamCSVimport ===
2+
+ Bad comments value specified
3+
+ Comment with non-default character
4+
+ Commented line at beginning
5+
+ Commented line at end
6+
+ Commented line in middle
7+
+ Entire file is comment lines
8+
+ Input is just a string (a single field)
9+
+ Input is just empty fields
10+
+ Input is just the delimiter (2 empty fields)
11+
+ Input with only a commented line and blank line after
12+
+ Input with only a commented line, without comments enabled
13+
+ Input without comments with line starting with whitespace
14+
+ Line ends with quoted field
15+
+ Line starts with quoted field
16+
+ Misplaced quotes in data, not as opening quotes
17+
+ Multiple consecutive empty fields
18+
+ Multiple rows, one column (no delimiter found)
19+
+ One column input with empty fields
20+
+ One Row
21+
+ Pipe delimiter
22+
+ Quoted field at end of row (but not at EOF) has quotes
23+
+ Quoted field has no closing quote
24+
+ Quoted field with 5 quotes in a row and a delimiter
25+
+ Quoted field with delimiter
26+
+ Quoted field with escaped quotes at boundaries
27+
+ Quoted field with escaped quotes
28+
+ Quoted field with extra whitespace on edges
29+
+ Quoted field with line break
30+
+ Quoted field with quotes around delimiter
31+
+ Quoted field with quotes on left side of delimiter
32+
+ Quoted field with quotes on right side of delimiter
33+
+ Quoted field with Unix escaped quotes at boundaries
34+
+ Quoted field with whitespace around quotes
35+
+ Quoted field
36+
+ Quoted fields at end of row with delimiter and line break
37+
X Quoted fields with line breaks
38+
Expected 3 fields and 1 record
39+
+ Row with enough fields but blank field at end
40+
+ Row with too few fields
41+
+ Row with too many fields
42+
+ Skip empty lines, with empty input
43+
+ Skip empty lines, with first line only whitespace
44+
+ Skip empty lines, with newline at end of input
45+
+ Tab delimiter
46+
+ Three comment lines consecutively at beginning of file
47+
+ Two comment lines consecutively at end of file
48+
+ Two comment lines consecutively
49+
+ Two rows
50+
+ Unquoted field with quotes at end of field
51+
+ Whitespace at edges of unquoted field
52+
+ Complex CSV syntax
53+
= FAIL (1 of 50 failed) = 18/1/2024 8:40:43 p.�m. =
54+
55+
=== StringCSVimport ===
56+
+ Bad comments value specified
57+
+ Comment with non-default character
58+
+ Commented line at beginning
59+
+ Commented line at end
60+
+ Commented line in middle
61+
+ Entire file is comment lines
62+
+ Input is just a string (a single field)
63+
+ Input is just empty fields
64+
+ Input is just the delimiter (2 empty fields)
65+
+ Input with only a commented line and blank line after
66+
+ Input with only a commented line, without comments enabled
67+
+ Input without comments with line starting with whitespace
68+
+ Line ends with quoted field
69+
+ Line starts with quoted field
70+
+ Misplaced quotes in data, not as opening quotes
71+
+ Multiple consecutive empty fields
72+
+ Multiple rows, one column (no delimiter found)
73+
+ One column input with empty fields
74+
+ One Row
75+
+ Pipe delimiter
76+
+ Quoted field at end of row (but not at EOF) has quotes
77+
+ Quoted field has no closing quote
78+
+ Quoted field with 5 quotes in a row and a delimiter
79+
+ Quoted field with delimiter
80+
+ Quoted field with escaped quotes at boundaries
81+
+ Quoted field with escaped quotes
82+
+ Quoted field with extra whitespace on edges
83+
+ Quoted field with line break
84+
+ Quoted field with quotes around delimiter
85+
+ Quoted field with quotes on left side of delimiter
86+
+ Quoted field with quotes on right side of delimiter
87+
+ Quoted field with Unix escaped quotes at boundaries
88+
+ Quoted field with whitespace around quotes
89+
+ Quoted field
90+
+ Quoted fields at end of row with delimiter and line break
91+
X Quoted fields with line breaks
92+
Expected 3 fields and 1 record
93+
+ Row with enough fields but blank field at end
94+
+ Row with too few fields
95+
+ Row with too many fields
96+
+ Skip empty lines, with empty input
97+
+ Skip empty lines, with first line only whitespace
98+
+ Skip empty lines, with newline at end of input
99+
+ Tab delimiter
100+
+ Three comment lines consecutively at beginning of file
101+
+ Two comment lines consecutively at end of file
102+
+ Two comment lines consecutively
103+
+ Two rows
104+
+ Unquoted field with quotes at end of field
105+
+ Whitespace at edges of unquoted field
106+
+ Complex CSV syntax
107+
= FAIL (1 of 50 failed) = 18/1/2024 8:40:47 p.�m. =
108+
109+
=== SequentialCSVimport ===
110+
+ Bad comments value specified
111+
+ Comment with non-default character
112+
+ Commented line at beginning
113+
+ Commented line at end
114+
+ Commented line in middle
115+
+ Entire file is comment lines
116+
+ Input is just a string (a single field)
117+
+ Input is just empty fields
118+
+ Input is just the delimiter (2 empty fields)
119+
+ Input with only a commented line and blank line after
120+
X Input with only a commented line, without comments enabled
121+
Expected 1 records with 1 fields
122+
+ Input without comments with line starting with whitespace
123+
+ Line ends with quoted field
124+
+ Line starts with quoted field
125+
+ Misplaced quotes in data, not as opening quotes
126+
+ Multiple consecutive empty fields
127+
+ Multiple rows, one column (no delimiter found)
128+
+ One column input with empty fields
129+
+ One Row
130+
+ Pipe delimiter
131+
+ Quoted field at end of row (but not at EOF) has quotes
132+
+ Quoted field has no closing quote
133+
+ Quoted field with 5 quotes in a row and a delimiter
134+
+ Quoted field with delimiter
135+
+ Quoted field with escaped quotes at boundaries
136+
+ Quoted field with escaped quotes
137+
+ Quoted field with extra whitespace on edges
138+
+ Quoted field with line break
139+
+ Quoted field with quotes around delimiter
140+
+ Quoted field with quotes on left side of delimiter
141+
+ Quoted field with quotes on right side of delimiter
142+
+ Quoted field with Unix escaped quotes at boundaries
143+
+ Quoted field with whitespace around quotes
144+
+ Quoted field
145+
+ Quoted fields at end of row with delimiter and line break
146+
X Quoted fields with line breaks
147+
Expected 3 fields and 1 record
148+
+ Row with enough fields but blank field at end
149+
+ Row with too few fields
150+
+ Row with too many fields
151+
+ Skip empty lines, with empty input
152+
+ Skip empty lines, with first line only whitespace
153+
+ Skip empty lines, with newline at end of input
154+
+ Tab delimiter
155+
+ Three comment lines consecutively at beginning of file
156+
+ Two comment lines consecutively at end of file
157+
+ Two comment lines consecutively
158+
+ Two rows
159+
+ Unquoted field with quotes at end of field
160+
+ Whitespace at edges of unquoted field
161+
+ Complex CSV syntax
162+
= FAIL (2 of 50 failed) = 18/1/2024 8:40:50 p.�m. =
163+

0 commit comments

Comments
 (0)