You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -35,21 +35,21 @@ The dataset used in this experiment is based on the the multilingual parallel ED
35
35
36
36
3 new datasets were derived from the EDGe corpus:
37
37
- EDGe_Zipped_Sizes.xlsx: contains the sizes of all the files in EDGe when they are zipped
38
-
-morph_zipped_all.xlsx: contains the sizes of all the files in EDGe when they are morphologically distorted and then zipped over 1000 iterations
39
-
-synt_zipped_all.xlsx: contains the sizes of all the files in EDGe when they are syntactically distorted and then zipped over 1000 iterations
38
+
-EDGe_Morph_Zipped.xlsx: contains the sizes of all the files in EDGe when they are morphologically distorted and then zipped over 1000 iterations
39
+
-EDGe_Synt_zipped.xlsx: contains the sizes of all the files in EDGe when they are syntactically distorted and then zipped over 1000 iterations
40
40
41
41
### Workflow & code
42
42
#### Step 1: create a file with all the file sizes of the zipped files in EDGe - EDGe_Zipped_Sizes.xlsx
43
43
In order to create EDGe_Zipped_Sizes.xlsx first all the files in the dataset need to be zipped. This is done by running gzip_files.py. The second step is retrieving all the file sizes of the zipped files. This is done by running file_size.py on the newly created zipped files.
In this step all files are first morphologically distorted and subsequently zipped. Morphological distortion is achieved as described above, by randomly deleting 10% of all characters in the file. For each file this is done 1000 times and each time the size of the file is stored in morph_zipped_all.xlsx. This step requires morphological_distortion_pipeline.py.
46
+
In this step all files are first morphologically distorted and subsequently zipped. Morphological distortion is achieved as described above, by randomly deleting 10% of all characters in the file. For each file this is done 1000 times and each time the size of the file is stored in EDGe_Morph_Zipped.xlsx. This step requires morphological_distortion_pipeline.py.
In this step all files are first syntactically distorted and subsequently zipped. Syntactic distortion is achieved as described above, by randomly deleting 10% of all words in the file. For each file this is done 1000 times and each time the size of the file is stored in synt_zipped_all.xlsx. This step requires syntactic_distortion_pipeline.py.
49
+
In this step all files are first syntactically distorted and subsequently zipped. Syntactic distortion is achieved as described above, by randomly deleting 10% of all words in the file. For each file this is done 1000 times and each time the size of the file is stored in EDGe_Synt_zipped.xlsx. This step requires syntactic_distortion_pipeline.py.
50
50
51
51
#### Step 4: statistical analysis in R
52
-
The statistical analysis of the created datasets (input = EDGe_Zipped_Sizes.xlsx, morph_zipped_all.xlsx and synt_zipped_all.xlsx) is done by running complexity_analysis.R. The script calculates the morphological and syntactic complexity as described above. The output of this script are graphs in .png format.
52
+
The statistical analysis of the created datasets (input = EDGe_Zipped_Sizes.xlsx, EDGe_Morph_Zipped.xlsx and EDGe_Synt_zipped.xlsx) is done by running complexity_analysis.R. The script calculates the morphological and syntactic complexity as described above. The output of this script are graphs in .png format.
53
53
54
54
### Result
55
55

0 commit comments