|
343 | 343 | include_url_regex='/shop/'
|
344 | 344 | )
|
345 | 345 |
|
| 346 | +Customizing the Columns Included in the output file |
| 347 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 348 | +
|
| 349 | +By default, the crawler will extract all the available elements as described in the documentation. If you want to customize the columns included in the output file, you can use the ``keep_columns`` and ``discard_columns`` parameters. |
| 350 | +
|
| 351 | +1. The ``keep_columns`` parameter allows you to specify which columns should always be included in the output file. |
| 352 | +2. The ``discard_columns`` parameter lets you specify which columns should be excluded. |
| 353 | +
|
| 354 | +The filtering is done in the following order: first, the ``keep_columns`` regex patterns are applied, and then the ``discard_columns`` patterns are applied. You cannot discard the ``url`` and ``errors`` columns as they are always kept by default. |
| 355 | +
|
| 356 | +For example, if we execute the following code snippet: |
| 357 | +
|
| 358 | +>>> adv.crawl( |
| 359 | +... "http://example.com", |
| 360 | +... "output_file.jl", |
| 361 | +... keep_columns=["url", "title"], |
| 362 | +... ) |
| 363 | +
|
| 364 | +
|
| 365 | +Our ``output_file.jl`` will contain the ``url``, ``title`` and ``errors`` columns. If we execute the following: |
| 366 | +
|
| 367 | +>>> adv.crawl( |
| 368 | +... "http://example.com", |
| 369 | +... "output_file.jl", |
| 370 | +... discard_columns=["title"], |
| 371 | +... ) |
| 372 | +
|
| 373 | +Our ``output.jl`` file will contain all columns but ``title``. |
| 374 | +
|
346 | 375 | Spider Custom Settings and Additional Functionality
|
347 | 376 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
348 | 377 |
|
@@ -522,7 +551,7 @@ def _filter_crawl_dict(d, keep_columns=None, discard_columns=None):
|
522 | 551 | Returns:
|
523 | 552 | - dict: A filtered dictionary.
|
524 | 553 | """
|
525 |
| - always_include = {"url", "errors", "jsonld_errors"} |
| 554 | + always_include = {"url", "errors"} |
526 | 555 |
|
527 | 556 | def matches_any(patterns, key):
|
528 | 557 | return any(re.search(pattern, key) for pattern in patterns)
|
@@ -1072,7 +1101,7 @@ def crawl(
|
1072 | 1101 | A list of regex patterns for the columns to discard in the output. If not
|
1073 | 1102 | specified, all columns are kept. If both ``keep_columns`` and
|
1074 | 1103 | ``discard_columns`` are specified, the columns will be filtered based on the
|
1075 |
| - ``keep_columns`` regex patterns first, and then the ``discard_columns``. |
| 1104 | + ``keep_columns`` regex patterns first, and then the ``discard_columns``. You cannot discard the ``url`` and ``errors`` columns as they are always kept by default. |
1076 | 1105 | Examples
|
1077 | 1106 | --------
|
1078 | 1107 | Crawl a website and let the crawler discover as many pages as available
|
|
0 commit comments