Skip to content

Commit d3c02d9

Browse files
authored
Merge pull request #104 from aws-samples/94-update-cqlreplicatorscala-for-memorydb-parquet-and-opensearch
Expended the stats
2 parents 04e072e + 97c1c38 commit d3c02d9

File tree

6 files changed

+399
-134
lines changed

6 files changed

+399
-134
lines changed

glue/README.MD

Lines changed: 44 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -81,12 +81,15 @@ After running the below command, you should find in the AWS account:
8181
## Start migration process
8282

8383
To operate CQLReplicator on AWS Glue, you need to use `--state run` command, followed by a series of parameters. The
84-
precise
85-
configuration of these is primarily determined by your unique migration requirements. For instance, these settings may
86-
vary
87-
if you choose to replicate TTLs, updates, or offloading objects exceeding 1MB to Amazon S3.
84+
precise configuration of these is primarily determined by your unique migration requirements. For instance, these
85+
settings may
86+
vary if you choose to replicate TTLs, updates, or offloading objects exceeding 1MB to Amazon S3.
8887

89-
Let's run the following command to replicate the workload from the Cassandra cluster to Amazon Keyspaces.
88+
Estimate the number of tiles, each tile (concurrent glue job) can handle up to 74M primary keys with 299 DPUs (G.025X)
89+
maximum,
90+
for example, 8 tiles can handle up to 598M primary keys.
91+
92+
Let's run the following command to replicate the workload from the Cassandra cluster to Amazon Keyspaces:
9093
9194
```shell
9295
cqlreplicator --state run --tiles 8 --landing-zone s3://cql-replicator-1234567890-us-west-2 --region us-west-2 \
@@ -146,14 +149,49 @@ In order, to restart failed CQLReplicator jobs, you need to re-run `--state run`
146149
147150
## Get migration stats
148151
149-
To obtain the number of rows replicated during the back filling phase, run the following command:
152+
To obtain the number of replicated rows during the back filling phase, run the following command:
150153
151154
```shell
152155
cqlreplicator --state stats --landing-zone s3://cql-replicator-1234567890-us-west-2 \
153156
--src-keyspace ks_test_cql_replicator --src-table test_cql_replicator \
154157
--region us-west-2
155158
```
156159
160+
To obtain the number of replicated rows after the back filling phase, run the following command:
161+
162+
```shell
163+
cqlreplicator --state stats --landing-zone s3://cql-replicator-1234567890-us-west-2 \
164+
--src-keyspace ks_test_cql_replicator --src-table test_cql_replicator \
165+
--region us-west-2 --replication-stats-enabled
166+
```
167+
168+
```
169+
___ ___ _ ____ _ _ _
170+
/ ___/ _ \| | | _ \ ___ _ __ | (_) ___ __ _| |_ ___ _ __
171+
| | | | | | | | |_) / _ \ '_ \| | |/ __/ _` | __/ _ \| '__|
172+
| |__| |_| | |___| _ < __/ |_) | | | (_| (_| | || (_) | |
173+
\____\__\_\_____|_| \_\___| .__/|_|_|\___\__,_|\__\___/|_|
174+
|_|
175+
·······································································
176+
: __ _______ _____ _____ :
177+
: /\ \ / / ____| | __ \ / ____| :
178+
: / \ \ /\ / / (___ | |__) | __ ___| (___ ___ _ ____ _____ :
179+
: / /\ \ \/ \/ / \___ \ | ___/ '__/ _ \\___ \ / _ \ '__\ \ / / _ \:
180+
: / ____ \ /\ / ____) | | | | | | (_) |___) | __/ | \ V / __/:
181+
:/_/ \_\/ \/ |_____/ |_| |_| \___/_____/ \___|_| \_/ \___|:
182+
·······································································
183+
[2024-02-02T16:50:18-05:00] OS: Linux
184+
+------------------------------------------------------------------------+
185+
| Tile | Inserts | Updates | Deletes | Timestamp |
186+
+------------------------------------------------------------------------+
187+
| 0 | 4 | 1 | 0 | "2024-02-02T21:50:12.888" |
188+
+------------------------------------------------------------------------+
189+
| 1 | 2 | 2 | 1 | "2024-02-02T21:50:25.454" |
190+
+------------------------------------------------------------------------+
191+
[2024-02-02T16:50:39-05:00] Discovered rows in casereview.casedetails is 70270
192+
[2024-02-02T16:50:39-05:00] Replicated rows in casereview.casedetails_2 is 70270
193+
```
194+
157195
## Cost optimization
158196
159197
In order to reduce AWS Glue costs after the historical workload moved to the target storage:

glue/bin/cqlreplicator

Lines changed: 74 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -10,10 +10,10 @@ JOB_NAME=CQLReplicator
1010
TILES=2
1111
PROCESS_TYPE_DISCOVERY=discovery
1212
PROCESS_TYPE_REPLICATION=replication
13-
SOURCE_KS=ks_test_cql_replicator
14-
SOURCE_TBL=test_cql_replicator
15-
TARGET_KS=ks_test_cql_replicator
16-
TARGET_TBL=test_cql_replicator
13+
SOURCE_KS=""
14+
SOURCE_TBL=""
15+
TARGET_KS=""
16+
TARGET_TBL=""
1717
WRITETIME_COLUMN="None"
1818
TTL_COLUMN="None"
1919
S3_LANDING_ZONE=""
@@ -41,6 +41,7 @@ SKIP_GLUE_CONNECTOR=false
4141
SKIP_KEYSPACES_LEDGER=false
4242
JSON_MAPPING=""
4343
REPLICATION_POINT_IN_TIME=0
44+
REPLICATION_STATS_ENABLED=false
4445
OS=$(uname -a | awk '{print $1}')
4546

4647
# Progress bar configuration
@@ -105,6 +106,25 @@ function check_input() {
105106
return 0
106107
}
107108

109+
print_stat_table() {
110+
# Assign the arguments to variables
111+
local tile=$1
112+
local inserts=$2
113+
local updates=$3
114+
local deletes=$4
115+
local timestamp=$5
116+
local head=$6
117+
if [[ $head == true ]]; then
118+
echo "+------------------------------------------------------------------------+"
119+
# Print the table header with a border
120+
printf "| %-8s | %-8s | %-8s | %-8s | %-20s |\n" "Tile" "Inserts" "Updates" "Deletes" "Timestamp"
121+
echo "+------------------------------------------------------------------------+"
122+
fi
123+
# Print the table data with a border
124+
printf "| %-8d | %-8d | %-8d | %-8d | %-20s |\n" $tile $inserts $updates $deletes "$timestamp"
125+
echo "+------------------------------------------------------------------------+"
126+
}
127+
108128
function check_discovery_runs() {
109129
local rs
110130
local mode
@@ -176,7 +196,7 @@ function uploader_helper() {
176196
progress $curr_pos $final_pos "Uploading $artifact_name "
177197
if ls "$path_to_conf/$artifact_name" > /dev/null
178198
then
179-
progress $next_pos $final_pos "Uploading $artifact_name "
199+
progress $next_pos $final_pos "Uploading $artifact_name "
180200
aws s3 cp "$path_to_conf"/"$artifact_name" "$S3_LANDING_ZONE"/artifacts/"$artifact_name" > /dev/null
181201
else
182202
log "ERROR: $path_to_conf/$artifact_name not found"
@@ -305,7 +325,7 @@ function Init {
305325
sleep 25
306326
if ls "$path_to_scala"/CQLReplicator.scala
307327
then
308-
progress 3 5 "Uploading CQLReplicator.scala "
328+
progress 3 5 "Uploading CQLReplicator.scala "
309329
aws s3 cp "$path_to_scala"/CQLReplicator.scala "$glue_bucket_artifacts"/scripts/CQLReplicator.scala > /dev/null
310330
else
311331
log "ERROR: $path_to_scala/CQLReplicator.scala not found"
@@ -316,7 +336,7 @@ function Init {
316336
# Create Glue Connector
317337
local glue_conn_name
318338
if [[ $SKIP_GLUE_CONNECTOR == false ]]; then
319-
progress 3 5 "Creating Glue connector and CQLReplicator job "
339+
progress 3 5 "Creating Glue connector and CQLReplicator job "
320340
glue_conn_name=$(echo cql-replicator-"$(uuidgen)" | tr ' [:upper:]' ' [:lower:]')
321341
aws glue create-connection --connection-input '{
322342
"Name":"'$glue_conn_name'",
@@ -395,76 +415,30 @@ function Init {
395415
fi
396416

397417
if [[ $SKIP_KEYSPACES_LEDGER == true ]]; then
398-
progress 4 5 "Skipping CQLReplicator's internal keyspace"
399-
progress 5 5 "Skipping CQLReplicator's internal table"
418+
progress 4 5 "Skipping CQLReplicator's internal keyspace "
419+
progress 5 5 "Skipping CQLReplicator's internal table "
400420
fi
401421

402422
if [[ $SKIP_KEYSPACES_LEDGER == false ]]; then
403-
progress 4 5 "Creating CQLReplicator's internal keyspace and table"
423+
progress 4 5 "Creating CQLReplicator's internal resources "
404424
# Create a keyspace - migration
405425
aws keyspaces create-keyspace --keyspace-name migration --region "$AWS_REGION" > /dev/null
406426
sleep 20
407427

408428
# Create a table - ledger
409429
aws keyspaces create-table --keyspace-name migration --table-name ledger --schema-definition '{
410-
"allColumns": [
411-
{
412-
"name": "ks",
413-
"type": "text"
414-
},
415-
{
416-
"name": "tbl",
417-
"type": "text"
418-
},
419-
{
420-
"name": "tile",
421-
"type": "int"
422-
},
423-
{
424-
"name": "ver",
425-
"type": "text"
426-
},
427-
{
428-
"name": "dt_load",
429-
"type": "timestamp"
430-
},
431-
{
432-
"name": "dt_offload",
433-
"type": "timestamp"
434-
},
435-
{
436-
"name": "load_status",
437-
"type": "text"
438-
},
439-
{
440-
"name": "location",
441-
"type": "text"
442-
},
443-
{
444-
"name": "offload_status",
445-
"type": "text"
446-
}
447-
],
448-
"partitionKeys": [
449-
{
450-
"name": "ks"
451-
},
452-
{
453-
"name": "tbl"
454-
}
455-
],
456-
"clusteringKeys": [
457-
{
458-
"name": "tile",
459-
"orderBy": "ASC"
460-
},
461-
{
462-
"name": "ver",
463-
"orderBy": "ASC"
464-
}
465-
]
466-
}' --region "$AWS_REGION" > /dev/null
467-
progress 5 5 "Creating CQLReplicator's internal keyspace and table"
430+
"allColumns": [ { "name": "ks", "type": "text" },
431+
{ "name": "tbl", "type": "text" },
432+
{ "name": "tile", "type": "int" },
433+
{ "name": "ver", "type": "text" },
434+
{ "name": "dt_load", "type": "timestamp" },
435+
{ "name": "dt_offload", "type": "timestamp" },
436+
{ "name": "load_status", "type": "text" },
437+
{ "name": "location", "type": "text" },
438+
{ "name": "offload_status", "type": "text" } ],
439+
"partitionKeys": [ { "name": "ks" }, { "name": "tbl" } ],
440+
"clusteringKeys": [ { "name": "tile", "orderBy": "ASC" }, { "name": "ver", "orderBy": "ASC" } ] }' --region "$AWS_REGION" > /dev/null
441+
progress 5 5 "Creating CQLReplicator's internal resources "
468442
fi
469443

470444
log "Deploy is completed"
@@ -514,7 +488,7 @@ function Start_Discovery {
514488
function Start_Replication {
515489
cnt=0
516490
KEYS_PER_TILE=$(aws s3 cp "$S3_LANDING_ZONE"/"$SOURCE_KS"/"$SOURCE_TBL"/stats/discovery/"$cnt"/count.json - | head | jq '.primaryKeys')
517-
log "Average primary keys per tile is $KEYS_PER_TILE"
491+
log "Sampled primary keys per tile is $KEYS_PER_TILE"
518492
local workers=$(( 2 + KEYS_PER_TILE/ROWS_PER_WORKER ))
519493
while [ $cnt -lt $TILES ]
520494
do
@@ -727,6 +701,10 @@ while (( "$#" )); do
727701
SKIP_KEYSPACES_LEDGER=true
728702
shift 1
729703
;;
704+
--replication-stats-enabled)
705+
REPLICATION_STATS_ENABLED=true
706+
shift 1
707+
;;
730708
--)
731709
shift
732710
break
@@ -768,6 +746,21 @@ function Gather_Stats() {
768746
then
769747
total_per_tile=$(aws s3 cp "$S3_LANDING_ZONE"/"$SOURCE_KS"/"$SOURCE_TBL"/stats/"$process_type"/"$tile"/count.json - | head | jq '.primaryKeys') && REPLICATED_TOTAL=$(( REPLICATED_TOTAL + total_per_tile ))
770748
fi
749+
if [[ $REPLICATION_STATS_ENABLED == true ]]; then
750+
local inserted=0
751+
inserted=$(aws s3 cp "$S3_LANDING_ZONE"/"$SOURCE_KS"/"$SOURCE_TBL"/stats/"$process_type"/"$tile"/count.json - | head | jq '.insertedPrimaryKeys')
752+
local updated=0
753+
updated=$(aws s3 cp "$S3_LANDING_ZONE"/"$SOURCE_KS"/"$SOURCE_TBL"/stats/"$process_type"/"$tile"/count.json - | head | jq '.updatedPrimaryKeys')
754+
local deleted=0
755+
deleted=$(aws s3 cp "$S3_LANDING_ZONE"/"$SOURCE_KS"/"$SOURCE_TBL"/stats/"$process_type"/"$tile"/count.json - | head | jq '.deletedPrimaryKeys')
756+
local timestamp=""
757+
timestamp=$(aws s3 cp "$S3_LANDING_ZONE"/"$SOURCE_KS"/"$SOURCE_TBL"/stats/"$process_type"/"$tile"/count.json - | head | jq '.updatedTimestamp')
758+
local header=true
759+
if [[ $tile != 0 ]]; then
760+
header=false
761+
fi
762+
print_stat_table "$tile" "$inserted" "$updated" "$deleted" "$timestamp" "$header"
763+
fi
771764
fi
772765
fi
773766
}
@@ -786,6 +779,10 @@ if [[ $STATE == stats ]]; then
786779
check_input "$SOURCE_TBL" "ERROR: source table name is empty, must be provided"
787780
check_input "$S3_LANDING_ZONE" "ERROR: landing zone must be provided"
788781
check_input "$AWS_REGION" "ERROR: landing zone must be provided"
782+
check_input "$SOURCE_KS" "ERROR: source keyspace name is empty, must be provided"
783+
check_input "$SOURCE_TBL" "ERROR: source table name is empty, must be provided"
784+
check_input "$TARGET_TBL" "ERROR: target table name is empty, must be provided"
785+
check_input "$TARGET_KS" "ERROR: target keyspace name is empty, must be provided"
789786
# the barrier without checking if the discovery job is running
790787
barrier "false"
791788
tile=0
@@ -795,7 +792,14 @@ if [[ $STATE == stats ]]; then
795792
Gather_Stats $tile "replication"
796793
((tile++))
797794
done
798-
799795
log "Discovered rows in" "$SOURCE_KS"."$SOURCE_TBL" is "$DISCOVERED_TOTAL"
800796
log "Replicated rows in" "$TARGET_KS"."$TARGET_TBL" is "$REPLICATED_TOTAL"
797+
if [[ $REPLICATION_STATS_ENABLED == true ]]; then
798+
t=0
799+
while [ $t -lt "$TILES" ]
800+
do
801+
Gather_Stats $t "detailed-replication"
802+
((t++))
803+
done
804+
fi
801805
fi

0 commit comments

Comments
 (0)