Bash Loop Through Array of Website Links Read

For and Read-While Loops in Bash

The loop is one of the most fundamental and powerful constructs in calculating, considering it allows u.s.a. to repeat a ready of commands, equally many times as we want, upon a listing of items of our choosing. Much of computational thinking involves taking one chore and solving information technology in a style that can be practical repeatedly to all other similar tasks, and the for loop is how nosotros make the computer do that repetitive work:

                      for            item            in            $items            do            task            $detail            done                  

Unlike most of the code we've written then far at the interactive prompt, a for-loop doesn't execute as soon as we hit Enter:

          user@host:~$ for detail in $items                  

We can write out every bit many commands every bit we want in the block between the do and done keywords:

                      do            command_1   command_2            # another for loop just for fun            for            a            in            $things;            do; command_3 a;            washed            command_4            washed                  

Merely until we reach washed, and hit Enter, does the for-loop do its work.

This is fundamentally different than the line-by-line command-and-response we've experienced so far at the prompt. And it presages how we will be programming further on: less accent on executing commands with each line, and more emphasis on planning the functionality of a program, and then executing information technology later.

Basic syntax

The syntax for for loops tin can exist confusing, and then here are some basic examples to prep/refresh your comprehension of them:

                      for            animal            in            canis familiaris cat            'fruit bat'            elephant ostrich            exercise                        echo            "I desire a                        $fauna                          for a pet"            done                  

Here'due south a more elaborate version using variables:

                      for            thing            in            $collection_of_things            do            some_program            $matter            another_program            $thing            >> data.txt            # every bit many commands equally we want            done                  

A command commutation can be used to generate the items that the for loop iterates across:

                      for            var_name            in            $(seq 1 100);            practise                        echo            "Counting                        $var_name            ..."            washed                  

If you need to read a list of lines from a file, and are admittedly sure that none of the lines contain a space within them:

                      for            url            in            $(cat list_of_urls.txt);            practice            roll            "            $url            "            >> everywebpage_combined.html            done                  

A read-while loop is a variation of the above, but is safer for reading lines from a file:

                      while                        read            url            do            curl            "            $url            "            >> everywebpage_combined.html            washed            < list_of_urls.txt                  

Constructing a basic for loop

Let's start from a beginning, with a very minimal for loop, and then built it into something more than elaborate, to help us become an understanding of their purpose.

The simplest loop

This is about every bit simple equally yous can make a for loop:

                      user@host:~$                        for            x            in            1            >                        practice            >                        echo            Hello            >                        washed            Hello                  

Did that seem pretty worthless? Yes information technology should have. I wrote four lines of code to do what information technology takes a single line to do, echo 'Hi'.

More than elements in the collection

It's hard to tell, but a "loop" did execute. It just executed once. OK, and so how do we brand it execute more than ane time? Add more (space-separated) elements to the correct of the in keyword. Let's add together iv more ane'southward:

                      user@host:~$                        for            x            in            1 one i 1            >                        do            >                        echo            Hi            >                        done            Hi Hi How-do-you-do Hi                  

OK, not very heady, just the plan definitely seemed to at least loop: 4 1's resulted in four echo commands being executed.

What happens when we replace those four 1's with different numbers? And mayhap a couple of words?

                      user@host:~$                        for            ten            in            Q Zebra 999 Smithsonian            >                        practise            >                        echo            Hullo            >                        done            Hello Hi Hi Hello                  

And…zippo. And then the loop doesn't automatically do anything specific to the collection of values we gave information technology. Not yet anyway.

Refer to the loop variable

Let's look to the left of the in keyword, and at that x. What'south the point of that x? A lowercase ten isn't the name of a keyword or command that nosotros've encountered so far (and executing it alone at the prompt will throw an error). So maybe it's a variable? Permit'southward endeavour referencing it in the echo statement:

                      user@host:~$                        for            10            in            Q Zebra 999 Smithsonian            >                        practise            >                        echo            How-do-you-do            $x            >                        done            Hi Q Hi Zebra Hullo 999 Hi Smithsonian                  

Bingo. This is pretty much the key workings of a for loop: - Get a drove of items/values (Q Zebra 999 Smithsonian) - Pass them into a for loop construct - Using the loop variable (x) equally a placeholder, write commands between the practise/washed block. - When the loop executes, the loop variable, x, takes the value of each of the items in the list – Q, Zebra, 999, Smithsonian, – and the block of commands between do and done is then executed. This sequence repeats in one case for every item in the listing.

The practice/done block can contain any sequence of commands, fifty-fifty another for-loop:

                      user@host:~$                        for            10            in            Q Zebra 999 Smithsonian            >                        practice            >                        repeat            Hullo            >                        done            Hi Q Hi Zebra Hi 999 Hi Smithsonian                  
                      user@host:~$                        for            x            in            $(seq 1 3);            do            >                        for            y            in            A B C;            exercise            >                        repeat            "            $x            :            $y            "            >                        done            >                        washed            ane:A 1:B one:C 2:A 2:B ii:C 3:A 3:B 3:C                  

Loops-inside-loops is a common construct in programming. For the nearly part, I'm going to try to avert assigning problems that would involve this kind of logic, as it can be tricky to untwist during debugging.

Read a file, line-by-line, reliably with read-while

Considering true cat prints a file line-by-line, the following for loop seems sensible:

                      user@host:~$                        for            line            in            $(cat list-of-dirs.txt)            >                        do            >                        echo            "            $line            "            >                        done                  

Notwithstanding, the command substitution will cause cat to divide words by space. If list-of-dirs.txt contains the post-obit:

          Apples Oranges Documents and Settings                  

The output of the for loop will be this:

          Apples Oranges Documents and Settings                  

A read-while loop will preserve the words inside a line:

                      user@host:~$                        while                        read            line            do                        echo            "            $line            "            done            < listing-of-dirs.txt Apples Oranges Documents and Settings                  

We can too pipe from the issue of a command by enclosing it in <( and ):

                      user@host:~$                        while                        read            line            exercise                        echo            "Word count per line:                        $line            "            done            < <(true cat list-of-dirs.txt | wc -w)            1 1 3                  

Pipes and loops

If you're coming from other languages, data streams may be unfamiliar to you. At least they are to me, as the syntax for working with them is far more direct and straightforward in Bash than in Ruby or Python.

However, if you're new to programming in any language, what might likewise be unclear is how working with information streams is different than working with loops.

For example, the post-obit snippet:

                      user@host:~$                        echo            "hello world i am here"            |            \            >            tr            '[:lower:]'            '[:upper:]'            | tr            ' '            '\north'            HELLO WORLD I AM Here                  

– produces the same output as this loop:

                      for            give-and-take            in            howdy world i am here;            do                        repeat            $word            | tr            '[:lower:]'            '[:upper:]'            done                  

And depending on your mental model of things, information technology does seem that in both examples, each give-and-take, due east.g. how-do-you-do, world, is passed through a procedure of translation (via tr) and then echoed.

Pipes and filters

Without getting into the fundamentals of the Unix system, in which a piping operates fundamentally different than a loop here, permit me suggest a mental workaround:

Programs that piping from stdin and stdout tin can usually be arranged as filters, in which a stream of data goes into a program, and comes out in a different format:

                      # send the stream through a reverse filter            user@host:~$                        echo            "howdy world i am here"            | rev ereh ma i dlrow olleh            # filter out the offset 2 characters            user@host:~$                        echo            "how-do-you-do world i am here"            | cutting -c 3- llo earth i am hither            # filter out the spaces            user@host:~$                        repeat            "howdy globe i am hither"            | tr -d            ' '            helloworldiamhere            # filter out words with less than four characters            user@host:~$                        echo            "hello world i am hither"            | grep -oE            '[a-z]{4,}'            hello world here                  

For tasks that are more just transforming data, from filter to filter, recall almost using a loop. What might such as a task be? Given a list of URLs, download each, and electronic mail the downloaded information, with a customized body and subject:

                      user@host:~$                        while                        read            url;            do            # download the page            content            =            $(gyre -Ls            $url            )            # count the words            num_of_words            =            $(            echo            $content            | wc -w)            # extract the title            championship            =            $(            echo            $content            | grep -oP            '(?<=<championship>)[^<]+'            )            # transport an email with the page's title and word count            echo            "            $content            "            | mail whoever@stanford.edu -s            "            $title            :                        $num_of_words                          words"            echo            "...Sending:                        $championship            :                        $num_of_words                          words"            washed            < urls.txt                  

The data input source, each URL in urls.txt, isn't really being filtered here. Instead, a multi-step chore is being washed for each URL.

Piping into read-while

That said, a loop itself tin be implemented as merely one more filter among filters. Take this variation of the read-while loop, in which the result of echo | grep is piped, line by line, into the while loop, which prints to stdout using echo, which is redirected to the file named some.txt:

                      repeat            'hey you'            | grep -oE            '[a-z]+'            |            while                        read            line;            do                        echo            discussion | wc -c            done            >> sometxt                  

This is not a construct that you may need to exercise often, if at all, only hopefully it reinforces pipe usage in Unix.

Less interactive programming

The frequent use of for loops, and similar constructs, means that we're moving by the skillful ol' days of typing in i line of commands and having it execute right later we hit Enter. No matter how many commands nosotros pack inside a for loop, zip happens until we striking the done keyword.

Write once. Then loop it

With that loss of line-past-line interaction with the crush, we lose the primary advantage of the interactive prompt: immediate feedback. And we still accept all the disadvantages: if we make a typo earlier in the cake of commands between do and done, we take to start all over.

And so here'south how we mitigate that:

Test your lawmaking, one case at a fourth dimension

One of the biggest mistakes novices make with for loops is they think a for loop immediately solves their problem. So, if what they have to exercise is download 10,000 URLs, but they can't properly download simply 1 URL, they think putting their flawed commands into a for loop is a step in the right direction.

Also this beingness a fundamentally misunderstanding of a for loop, the applied trouble is that y'all are at present running your broken code 10,000 times, which means yous accept to await ten,000 times as long to discover out that your lawmaking is, alas, all the same broken.

So pretend you've never heard of for loops. Pretend you have to download all 10,000 URLs, ane control a fourth dimension. Can you write the command to do it for the first URL. How nigh the second? Once you lot're reasonably confident that no minor syntax errors are tripping you up, then it's time to think about how to detect a general pattern for the nine,997 other URLs.

Write scripts

The interactive command-line is groovy. It was fun to get-go out with, and it'll be fun throughout your computing career. But when you have a big task in forepart of yous, involving more than than 10 lines of code, then it's fourth dimension to put that lawmaking into a shell script. Don't trust your fallible human fingers to flawlessly retype code.

img

Apply nano to work on loops and save them as beat out scripts. For longer files, I'll piece of work on my computer's text editor (Sublime Text) so upload to the server.

Exercise with web scraping

Only to ground the syntax and workings of the for-loop, hither'due south the thought process from turning a routine task into a loop:

For the numbers 1 through 10, utilize curl to download the Wikipedia entry for each number, and salve it to a file named "wiki-number-(whatever the number is).html"

The quondam fashioned mode

With only x URLs, we could set a couple of variables and and so re-create-and-paste the a curl command, x times, making changes to each line:

                      user@host:~$            curl http://en.wikipedia.org/wiki/one >            'wiki-number-1.html'            user@host:~$            curl http://en.wikipedia.org/wiki/2 >            'wiki-number-2.html'            user@host:~$            curl http://en.wikipedia.org/wiki/three >            'wiki-number-three.html'            user@host:~$            curl http://en.wikipedia.org/wiki/4 >            'wiki-number-4.html'            user@host:~$            curl http://en.wikipedia.org/wiki/5 >            'wiki-number-5.html'            user@host:~$            ringlet http://en.wikipedia.org/wiki/6 >            'wiki-number-six.html'            user@host:~$            ringlet http://en.wikipedia.org/wiki/7 >            'wiki-number-7.html'            user@host:~$            curl http://en.wikipedia.org/wiki/8 >            'wiki-number-8.html'            user@host:~$            gyre http://en.wikipedia.org/wiki/9 >            'wiki-number-9.html'            user@host:~$            roll http://en.wikipedia.org/wiki/10 >            'wiki-number-10.html'                  

And approximate what? Information technology works. For 10 URLs, it's non a bad solution, and it's significantly faster than doing it the old onetime-fashioned style (doing it from your web browser)

Reducing repetition

Even without thinking virtually a loop, we can still reduce repetition using variables: the base of operations URL, http://en.wikipedia.org/wiki/, and the base-filename never change, so let'due south assign those values to variables that tin can be reused:

                      user@host:~$                        base_url            =http://en.wikipedia.org/wiki            user@host:~$                        fname            =            'wiki-number'            user@host:~$            scroll            "            $base_url            /i"            >            "            $fname            -1"            user@host:~$            roll            "            $base_url            /2"            >            "            $fname            -2"            user@host:~$            roll            "            $base_url            /3"            >            "            $fname            -iii"            user@host:~$            gyre            "            $base_url            /iv"            >            "            $fname            -4"            user@host:~$            curl            "            $base_url            /5"            >            "            $fname            -5"            user@host:~$            curl            "            $base_url            /vi"            >            "            $fname            -6"            user@host:~$            ringlet            "            $base_url            /7"            >            "            $fname            -7"            user@host:~$            curl            "            $base_url            /8"            >            "            $fname            -viii"            user@host:~$            gyre            "            $base_url            /nine"            >            "            $fname            -9"            user@host:~$            curl            "            $base_url            /10"            >            "            $fname            -10"                  

Applying the for-loop

At this bespeak, we've simplified the pattern and then far that we can see how lilliputian changes with each separate chore. After learning virtually the for-loop, we can apply it without much thinking (we likewise add a sleep command so that we pause between web requests)

                      user@host:~$                        base_url            =http://en.wikipedia.org/wiki            user@host:~$                        fname            =            'wiki-number'            user@host:~$                        for            x            in            1 two 3 4 5 6 vii 8 9 ten            >                        do            >            curl            "            $base_url            /            $x            "            >            "            $fname            -            $x            "            >            sleep 2            >                        washed                  

Generating a list

In most situations, creating a for-loop is piece of cake; it's the cosmos of the listing that can exist the hard work. What if we wanted to collect the pages for numbers i through 100? That's a lot of typing.

Just if nosotros let our laziness dictate our thinking, we can imagine that counting from 10 to y seems like an inherently computational task. And it is, and Unix has the seq utility for this:

                      user@host:~$                        base_url            =http://en.wikipedia.org/wiki            user@host:~$                        fname            =            'wiki-number'            user@host:~$                        for            x            in            $(seq i 100)            >                        do            >            curl            "            $base_url            /            $ten            "            >            "wiki-number-            $x            "            >            slumber ii            >                        done                  

Generating a list of non-numbers for iteration

Many repetitive tasks aren't as simple as counting from x to y, and so the problem becomes how to generate a non-linear list of items? This is basically what the fine art of information-collection and direction. But let's make a simple scenario for ourselves:

For 10 of the ten-letter of the alphabet (or more) words that appear at least in one case in a headline on the current NYTimes.com front folio, fetch the Wiktionary page for that word

We break this task into two parts:

  1. Fetch a listing of x 10+-alphabetic character words from nytimes.com headlines
  2. Pass those words to our for-loop

Step 1: Using the pup utility (or command-line HTML parser of your selection):

                      user@host:~$                        words            =            $(whorl -s http://www.nytimes.com |            \            >            pup            'h2.story-heading text{}'            |            \            >            grep -oE            '[[:blastoff:]]{10,}'            | sort |            \            >            uniq | head -n 10)                  

Step 2 (assuming the words variable is existence passed forth):

                      user@host:~$                        base_url            =            'https://en.wiktionary.org/wiki/'            user@host:~$                        fname            =            'wiktionary-'            user@host:~$                        for            give-and-take            in            $words            >                        do            >                        echo            $word            >            curl -sL            "            $base_url$discussion            "            >            "            $fname$discussion            .html"            >            sleep ii            >                        done                  

Bank check out Software Carpentry's fantabulous guide to for-loops in Fustigate

Bash Loop Through Array of Website Links Read

Source: http://www.compciv.org/topics/bash/loops

0 Response to "Bash Loop Through Array of Website Links Read"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel