Skip to content

Commit 570d674

Browse files
authored
Update README.md
1 parent c4c7f53 commit 570d674

File tree

1 file changed

+5
-5
lines changed

1 file changed

+5
-5
lines changed

README.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -35,21 +35,21 @@ The dataset used in this experiment is based on the the multilingual parallel ED
3535

3636
3 new datasets were derived from the EDGe corpus:
3737
- EDGe_Zipped_Sizes.xlsx: contains the sizes of all the files in EDGe when they are zipped
38-
- morph_zipped_all.xlsx: contains the sizes of all the files in EDGe when they are morphologically distorted and then zipped over 1000 iterations
39-
- synt_zipped_all.xlsx: contains the sizes of all the files in EDGe when they are syntactically distorted and then zipped over 1000 iterations
38+
- EDGe_Morph_Zipped.xlsx: contains the sizes of all the files in EDGe when they are morphologically distorted and then zipped over 1000 iterations
39+
- EDGe_Synt_zipped.xlsx: contains the sizes of all the files in EDGe when they are syntactically distorted and then zipped over 1000 iterations
4040

4141
### Workflow & code
4242
#### Step 1: create a file with all the file sizes of the zipped files in EDGe - EDGe_Zipped_Sizes.xlsx
4343
In order to create EDGe_Zipped_Sizes.xlsx first all the files in the dataset need to be zipped. This is done by running gzip_files.py. The second step is retrieving all the file sizes of the zipped files. This is done by running file_size.py on the newly created zipped files.
4444

4545
#### Step 2: morphological distortion - morph_zipped_all.xlsx
46-
In this step all files are first morphologically distorted and subsequently zipped. Morphological distortion is achieved as described above, by randomly deleting 10% of all characters in the file. For each file this is done 1000 times and each time the size of the file is stored in morph_zipped_all.xlsx. This step requires morphological_distortion_pipeline.py.
46+
In this step all files are first morphologically distorted and subsequently zipped. Morphological distortion is achieved as described above, by randomly deleting 10% of all characters in the file. For each file this is done 1000 times and each time the size of the file is stored in EDGe_Morph_Zipped.xlsx. This step requires morphological_distortion_pipeline.py.
4747

4848
#### Step 3: syntactic distortion - synt_zipped_all.xlsx
49-
In this step all files are first syntactically distorted and subsequently zipped. Syntactic distortion is achieved as described above, by randomly deleting 10% of all words in the file. For each file this is done 1000 times and each time the size of the file is stored in synt_zipped_all.xlsx. This step requires syntactic_distortion_pipeline.py.
49+
In this step all files are first syntactically distorted and subsequently zipped. Syntactic distortion is achieved as described above, by randomly deleting 10% of all words in the file. For each file this is done 1000 times and each time the size of the file is stored in EDGe_Synt_zipped.xlsx. This step requires syntactic_distortion_pipeline.py.
5050

5151
#### Step 4: statistical analysis in R
52-
The statistical analysis of the created datasets (input = EDGe_Zipped_Sizes.xlsx, morph_zipped_all.xlsx and synt_zipped_all.xlsx) is done by running complexity_analysis.R. The script calculates the morphological and syntactic complexity as described above. The output of this script are graphs in .png format.
52+
The statistical analysis of the created datasets (input = EDGe_Zipped_Sizes.xlsx, EDGe_Morph_Zipped.xlsx and EDGe_Synt_zipped.xlsx) is done by running complexity_analysis.R. The script calculates the morphological and syntactic complexity as described above. The output of this script are graphs in .png format.
5353

5454
### Result
5555
![Syntactic vs morphological complexity ratio](https://user-images.githubusercontent.com/107923146/212687027-2c4eaac4-89a9-45b5-b8bf-000191aa7c16.png)

0 commit comments

Comments
 (0)