Bash Loop Through Array of Website Links Read
For and Read-While Loops in Bash
The loop is one of the most fundamental and powerful constructs in calculating, considering it allows u.s.a. to repeat a ready of commands, equally many times as we want, upon a listing of items of our choosing. Much of computational thinking involves taking one chore and solving information technology in a style that can be practical repeatedly to all other similar tasks, and the for loop is how nosotros make the computer do that repetitive work:
for item in $items do task $detail done Unlike most of the code we've written then far at the interactive prompt, a for-loop doesn't execute as soon as we hit Enter:
user@host:~$ for detail in $items We can write out every bit many commands every bit we want in the block between the do and done keywords:
do command_1 command_2 # another for loop just for fun for a in $things; do; command_3 a; washed command_4 washed Merely until we reach washed, and hit Enter, does the for-loop do its work.
This is fundamentally different than the line-by-line command-and-response we've experienced so far at the prompt. And it presages how we will be programming further on: less accent on executing commands with each line, and more emphasis on planning the functionality of a program, and then executing information technology later.
Basic syntax
The syntax for for loops tin can exist confusing, and then here are some basic examples to prep/refresh your comprehension of them:
for animal in canis familiaris cat 'fruit bat' elephant ostrich exercise echo "I desire a $fauna for a pet" done Here'due south a more elaborate version using variables:
for thing in $collection_of_things do some_program $matter another_program $thing >> data.txt # every bit many commands equally we want done A command commutation can be used to generate the items that the for loop iterates across:
for var_name in $(seq 1 100); practise echo "Counting $var_name ..." washed If you need to read a list of lines from a file, and are admittedly sure that none of the lines contain a space within them:
for url in $(cat list_of_urls.txt); practice roll " $url " >> everywebpage_combined.html done A read-while loop is a variation of the above, but is safer for reading lines from a file:
while read url do curl " $url " >> everywebpage_combined.html washed < list_of_urls.txt Constructing a basic for loop
Let's start from a beginning, with a very minimal for loop, and then built it into something more than elaborate, to help us become an understanding of their purpose.
The simplest loop
This is about every bit simple equally yous can make a for loop:
user@host:~$ for x in 1 > practice > echo Hello > washed Hello Did that seem pretty worthless? Yes information technology should have. I wrote four lines of code to do what information technology takes a single line to do, echo 'Hi'.
More than elements in the collection
It's hard to tell, but a "loop" did execute. It just executed once. OK, and so how do we brand it execute more than ane time? Add more (space-separated) elements to the correct of the in keyword. Let's add together iv more ane'southward:
user@host:~$ for x in 1 one i 1 > do > echo Hi > done Hi Hi How-do-you-do Hi OK, not very heady, just the plan definitely seemed to at least loop: 4 1's resulted in four echo commands being executed.
What happens when we replace those four 1's with different numbers? And mayhap a couple of words?
user@host:~$ for ten in Q Zebra 999 Smithsonian > practise > echo Hullo > done Hello Hi Hi Hello And…zippo. And then the loop doesn't automatically do anything specific to the collection of values we gave information technology. Not yet anyway.
Refer to the loop variable
Let's look to the left of the in keyword, and at that x. What'south the point of that x? A lowercase ten isn't the name of a keyword or command that nosotros've encountered so far (and executing it alone at the prompt will throw an error). So maybe it's a variable? Permit'southward endeavour referencing it in the echo statement:
user@host:~$ for 10 in Q Zebra 999 Smithsonian > practise > echo How-do-you-do $x > done Hi Q Hi Zebra Hullo 999 Hi Smithsonian Bingo. This is pretty much the key workings of a for loop: - Get a drove of items/values (Q Zebra 999 Smithsonian) - Pass them into a for loop construct - Using the loop variable (x) equally a placeholder, write commands between the practise/washed block. - When the loop executes, the loop variable, x, takes the value of each of the items in the list – Q, Zebra, 999, Smithsonian, – and the block of commands between do and done is then executed. This sequence repeats in one case for every item in the listing.
The practice/done block can contain any sequence of commands, fifty-fifty another for-loop:
user@host:~$ for 10 in Q Zebra 999 Smithsonian > practice > repeat Hullo > done Hi Q Hi Zebra Hi 999 Hi Smithsonian user@host:~$ for x in $(seq 1 3); do > for y in A B C; exercise > repeat " $x : $y " > done > washed ane:A 1:B one:C 2:A 2:B ii:C 3:A 3:B 3:C Loops-inside-loops is a common construct in programming. For the nearly part, I'm going to try to avert assigning problems that would involve this kind of logic, as it can be tricky to untwist during debugging.
Read a file, line-by-line, reliably with read-while
Considering true cat prints a file line-by-line, the following for loop seems sensible:
user@host:~$ for line in $(cat list-of-dirs.txt) > do > echo " $line " > done Notwithstanding, the command substitution will cause cat to divide words by space. If list-of-dirs.txt contains the post-obit:
Apples Oranges Documents and Settings The output of the for loop will be this:
Apples Oranges Documents and Settings A read-while loop will preserve the words inside a line:
user@host:~$ while read line do echo " $line " done < listing-of-dirs.txt Apples Oranges Documents and Settings We can too pipe from the issue of a command by enclosing it in <( and ):
user@host:~$ while read line exercise echo "Word count per line: $line " done < <(true cat list-of-dirs.txt | wc -w) 1 1 3 Pipes and loops
If you're coming from other languages, data streams may be unfamiliar to you. At least they are to me, as the syntax for working with them is far more direct and straightforward in Bash than in Ruby or Python.
However, if you're new to programming in any language, what might likewise be unclear is how working with information streams is different than working with loops.
For example, the post-obit snippet:
user@host:~$ echo "hello world i am here" | \ > tr '[:lower:]' '[:upper:]' | tr ' ' '\north' HELLO WORLD I AM Here – produces the same output as this loop:
for give-and-take in howdy world i am here; do repeat $word | tr '[:lower:]' '[:upper:]' done And depending on your mental model of things, information technology does seem that in both examples, each give-and-take, due east.g. how-do-you-do, world, is passed through a procedure of translation (via tr) and then echoed.
Pipes and filters
Without getting into the fundamentals of the Unix system, in which a piping operates fundamentally different than a loop here, permit me suggest a mental workaround:
Programs that piping from stdin and stdout tin can usually be arranged as filters, in which a stream of data goes into a program, and comes out in a different format:
# send the stream through a reverse filter user@host:~$ echo "howdy world i am here" | rev ereh ma i dlrow olleh # filter out the offset 2 characters user@host:~$ echo "how-do-you-do world i am here" | cutting -c 3- llo earth i am hither # filter out the spaces user@host:~$ repeat "howdy globe i am hither" | tr -d ' ' helloworldiamhere # filter out words with less than four characters user@host:~$ echo "hello world i am hither" | grep -oE '[a-z]{4,}' hello world here For tasks that are more just transforming data, from filter to filter, recall almost using a loop. What might such as a task be? Given a list of URLs, download each, and electronic mail the downloaded information, with a customized body and subject:
user@host:~$ while read url; do # download the page content = $(gyre -Ls $url ) # count the words num_of_words = $( echo $content | wc -w) # extract the title championship = $( echo $content | grep -oP '(?<=<championship>)[^<]+' ) # transport an email with the page's title and word count echo " $content " | mail whoever@stanford.edu -s " $title : $num_of_words words" echo "...Sending: $championship : $num_of_words words" washed < urls.txt The data input source, each URL in urls.txt, isn't really being filtered here. Instead, a multi-step chore is being washed for each URL.
Piping into read-while
That said, a loop itself tin be implemented as merely one more filter among filters. Take this variation of the read-while loop, in which the result of echo | grep is piped, line by line, into the while loop, which prints to stdout using echo, which is redirected to the file named some.txt:
repeat 'hey you' | grep -oE '[a-z]+' | while read line; do echo discussion | wc -c done >> sometxt This is not a construct that you may need to exercise often, if at all, only hopefully it reinforces pipe usage in Unix.
Less interactive programming
The frequent use of for loops, and similar constructs, means that we're moving by the skillful ol' days of typing in i line of commands and having it execute right later we hit Enter. No matter how many commands nosotros pack inside a for loop, zip happens until we striking the done keyword.
Write once. Then loop it
With that loss of line-past-line interaction with the crush, we lose the primary advantage of the interactive prompt: immediate feedback. And we still accept all the disadvantages: if we make a typo earlier in the cake of commands between do and done, we take to start all over.
And so here'south how we mitigate that:
Test your lawmaking, one case at a fourth dimension
One of the biggest mistakes novices make with for loops is they think a for loop immediately solves their problem. So, if what they have to exercise is download 10,000 URLs, but they can't properly download simply 1 URL, they think putting their flawed commands into a for loop is a step in the right direction.
Also this beingness a fundamentally misunderstanding of a for loop, the applied trouble is that y'all are at present running your broken code 10,000 times, which means yous accept to await ten,000 times as long to discover out that your lawmaking is, alas, all the same broken.
So pretend you've never heard of for loops. Pretend you have to download all 10,000 URLs, ane control a fourth dimension. Can you write the command to do it for the first URL. How nigh the second? Once you lot're reasonably confident that no minor syntax errors are tripping you up, then it's time to think about how to detect a general pattern for the nine,997 other URLs.
Write scripts
The interactive command-line is groovy. It was fun to get-go out with, and it'll be fun throughout your computing career. But when you have a big task in forepart of yous, involving more than than 10 lines of code, then it's fourth dimension to put that lawmaking into a shell script. Don't trust your fallible human fingers to flawlessly retype code.
Apply nano to work on loops and save them as beat out scripts. For longer files, I'll piece of work on my computer's text editor (Sublime Text) so upload to the server.
Exercise with web scraping
Only to ground the syntax and workings of the for-loop, hither'due south the thought process from turning a routine task into a loop:
For the numbers 1 through 10, utilize curl to download the Wikipedia entry for each number, and salve it to a file named "
wiki-number-(whatever the number is).html"
The quondam fashioned mode
With only x URLs, we could set a couple of variables and and so re-create-and-paste the a curl command, x times, making changes to each line:
user@host:~$ curl http://en.wikipedia.org/wiki/one > 'wiki-number-1.html' user@host:~$ curl http://en.wikipedia.org/wiki/2 > 'wiki-number-2.html' user@host:~$ curl http://en.wikipedia.org/wiki/three > 'wiki-number-three.html' user@host:~$ curl http://en.wikipedia.org/wiki/4 > 'wiki-number-4.html' user@host:~$ curl http://en.wikipedia.org/wiki/5 > 'wiki-number-5.html' user@host:~$ ringlet http://en.wikipedia.org/wiki/6 > 'wiki-number-six.html' user@host:~$ ringlet http://en.wikipedia.org/wiki/7 > 'wiki-number-7.html' user@host:~$ curl http://en.wikipedia.org/wiki/8 > 'wiki-number-8.html' user@host:~$ gyre http://en.wikipedia.org/wiki/9 > 'wiki-number-9.html' user@host:~$ roll http://en.wikipedia.org/wiki/10 > 'wiki-number-10.html' And approximate what? Information technology works. For 10 URLs, it's non a bad solution, and it's significantly faster than doing it the old onetime-fashioned style (doing it from your web browser)
Reducing repetition
Even without thinking virtually a loop, we can still reduce repetition using variables: the base of operations URL, http://en.wikipedia.org/wiki/, and the base-filename never change, so let'due south assign those values to variables that tin can be reused:
user@host:~$ base_url =http://en.wikipedia.org/wiki user@host:~$ fname = 'wiki-number' user@host:~$ scroll " $base_url /i" > " $fname -1" user@host:~$ roll " $base_url /2" > " $fname -2" user@host:~$ roll " $base_url /3" > " $fname -iii" user@host:~$ gyre " $base_url /iv" > " $fname -4" user@host:~$ curl " $base_url /5" > " $fname -5" user@host:~$ curl " $base_url /vi" > " $fname -6" user@host:~$ ringlet " $base_url /7" > " $fname -7" user@host:~$ curl " $base_url /8" > " $fname -viii" user@host:~$ gyre " $base_url /nine" > " $fname -9" user@host:~$ curl " $base_url /10" > " $fname -10" Applying the for-loop
At this bespeak, we've simplified the pattern and then far that we can see how lilliputian changes with each separate chore. After learning virtually the for-loop, we can apply it without much thinking (we likewise add a sleep command so that we pause between web requests)
user@host:~$ base_url =http://en.wikipedia.org/wiki user@host:~$ fname = 'wiki-number' user@host:~$ for x in 1 two 3 4 5 6 vii 8 9 ten > do > curl " $base_url / $x " > " $fname - $x " > sleep 2 > washed Generating a list
In most situations, creating a for-loop is piece of cake; it's the cosmos of the listing that can exist the hard work. What if we wanted to collect the pages for numbers i through 100? That's a lot of typing.
Just if nosotros let our laziness dictate our thinking, we can imagine that counting from 10 to y seems like an inherently computational task. And it is, and Unix has the seq utility for this:
user@host:~$ base_url =http://en.wikipedia.org/wiki user@host:~$ fname = 'wiki-number' user@host:~$ for x in $(seq i 100) > do > curl " $base_url / $ten " > "wiki-number- $x " > slumber ii > done Generating a list of non-numbers for iteration
Many repetitive tasks aren't as simple as counting from x to y, and so the problem becomes how to generate a non-linear list of items? This is basically what the fine art of information-collection and direction. But let's make a simple scenario for ourselves:
For 10 of the ten-letter of the alphabet (or more) words that appear at least in one case in a headline on the current NYTimes.com front folio, fetch the Wiktionary page for that word
We break this task into two parts:
- Fetch a listing of x 10+-alphabetic character words from nytimes.com headlines
- Pass those words to our for-loop
Step 1: Using the pup utility (or command-line HTML parser of your selection):
user@host:~$ words = $(whorl -s http://www.nytimes.com | \ > pup 'h2.story-heading text{}' | \ > grep -oE '[[:blastoff:]]{10,}' | sort | \ > uniq | head -n 10) Step 2 (assuming the words variable is existence passed forth):
user@host:~$ base_url = 'https://en.wiktionary.org/wiki/' user@host:~$ fname = 'wiktionary-' user@host:~$ for give-and-take in $words > do > echo $word > curl -sL " $base_url$discussion " > " $fname$discussion .html" > sleep ii > done Bank check out Software Carpentry's fantabulous guide to for-loops in Fustigate
Bash Loop Through Array of Website Links Read
Source: http://www.compciv.org/topics/bash/loops
0 Response to "Bash Loop Through Array of Website Links Read"
Post a Comment